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Abstract —In this correspondence, we illustrate among other 
things the use of the stationarity property of the set of capacity- 
achieving inputs in capacity calculations. In particular, as a 
case study, we consider a hit-patterned media recording channel 
model and formulate new lower and upper hounds on its capacity 
that yield improvements over existing results. Inspired hy the 
observation that the new hounds are tight at low noise levels, we 
also characterize the capacity of this model as a series expansion 
in the low-noise regime. 

The key to these results is the realization of stationarity in 
the supremizing input set in the capacity formula. While the 
property is prevalent in capacity formulations in the ergodic- 
theoretic literature, we show that this realization is possible in 
the Shannon-theoretic framework where a channel is defined 
as a sequence of finite-dimensional conditional probabilities, by 
defining a new class of consistent stationary and ergodic channels. 

Index Terms —Channel capacity, stationary inputs, stationary 
and ergodic channel, bit-symmetry, bit-patterned media record¬ 
ing, lower/npper bounds, series expansion. 


I. Background 

The fundamental limit of information transmission through 
noisy channels, the channel capacity, has been a holy grail in 
information theory. The capacity problem of a general point- 
to-point channel has been well resolved with the information- 
spectrum framework [1]. Such general formula for the capac¬ 
ity, however, does not lend itself to computation in general, 
since it requires one to scrutinize the distribution of the 
information density at the limit of infinite block length. To 
overcome this problem, a common approach is to find an 
alternative expression that, instead of being described by an 
information-spectrum quantity, contains a mutual information 
quantity (or entropy quantities). While an expression of this 
kind is not as general, it may cover a sufficiently large class 
of channels for many practical purposes. 
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There are two popular forms of such expression. One is 
Dobrushin’s information-stable channel capacity [2]: 

C= lim -sup/(V”;y”) (1) 

n-fcxD n X" 

where the supremum is over all possible sequences of distribu¬ 
tions : W” ^ This formula holds for the class 

of information-stable channels. A similar formula also appears 
in the context of (decomposable or indecomposable) finite- 
state channels [3]. The other form swaps the supremum and 
the limit in the above formula, with the supremum being taken 
over a smaller set of input distributions with special structures. 
This type of formula appears in the ergodic-theoretic literature 
of information theory. For example, for d-continuous discrete 
stationary and ergodic (SE) two-sided channels, the capacity 
was shown to be [4] 

C= sup lim (2) 

Stationary ti —^cjo Tl 

where /i is a probability measure that describes the input 
process. (See Section I-B for the distinction between the two 
sets being supremized over in the above formulas.) 

Such capacity formulations that involve supremization over 
stationary inputs are common in the ergodic-theoretic setting, 
where a channel only admits infinitely long input sequences, 
mostly for SE channels with memory and anticipation [5], 
[4]. In the Shannon-theoretic framework, where an admissible 
input is a sequence of finite-dimensional distributions, capacity 
formulations similar to Eq. (1) are, however, more prevalent'. 

Intuitively constraints on the supremizing input set reduce 
the capacity-achieving input search space and may become 
useful. A major aim of this paper is to illustrate possible 
ways to exploit the stationarity property of the supremizing 
input set in capacity calculations, via a case study of a 
bit-patterned media recording (BPMR) channel model. This 
model was first introduced in [8] and subsequently studied 
in [9]. To achieve ultra-high density magnetic recording, in 
BPMR technologies, the data write process takes place on a 
new magnetic medium comprising of magnetic islands that 
are separated by non-magnetic materials. The difficulty in 
maintaining synchronization between the write head’s position 
and the correct island where data is to be written in is 
captured by a channel model with paired insertion-deletion 

*An exception is the standard insertion-deletion channel, where stationary 
and ergodic inputs could achieve capacity [6], [7]. This is a specific channel 
model that Is beyond the scope of this paper. 
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errors, underlied by a first-order Markov process. The channel 
input is the actual data to be written and the output is the data 
as written on the islands. This channel model is essentially 
a hnite-state channel with dependent insertion-deletion (DID) 
errors, and is henceforth called the DID channel model. 

Both works [8] and [9] analyze the capacity of the DID 
channel using a formula of the form of Eq. (1). In our case 
study, we will contrast this with the use of a formula that 
involves stationary inputs, which is able to yield improvements 
and new results. In particular, the formula we seek assumes a 
form similar to Eq. (2). We consider both Shannon-theoretic 
and ergodic-theoretic frameworks. While we analyze specih- 
cally the DID channel, we believe techniques we present to 
evaluate its capacity could be applicable to similar BPMR 
channel models, e.g. the model in [10], and beyond. 

We give a high-level view of the calculation techniques. As 
in Eq. (1) and (2), computing the capacity generally involves 
an infinite-dimensional optimization problem, which in most 
cases is intractable. Our idea towards computational feasibility 
is to decompose I (X";y") into a sum of hnite-dimensional 
terms of the form 

i=l 

where |X]j indicates a hnite-size group of terms in the 
neighborhood of Xi (i.e. Xi_a+i, • ■ •, X^+b) for some 

hnite non-negative constants a and b), and the same for |y]j 
and |y]j, and / is a function mapping the joint distribution 
of (|Ar]j, |y]j, |y]j) to IR+ and is independent of the index 
i. Here {Zj- can be thought of as an external source of 
randomness introduced by the channel. The approximation 
is to be replaced with an exact inequality (i.e. “<” or 
“>”). Given the hnite-dimensional distribution of each input 
block |A:] ., / (|X] -, |y] -, p] J can be computed. However 
the curse of dimensionality is not yet completely eliminated: 
for a general input, we have n possibly different distributions 
for[Xli,[Xl2,...,[XL. 

The problem is resolved when the input is stationary, in 
which we reduce the number of said distributions from n 
to one, which is the lowest possible. Eurthermore, for many 
classes of channels, input stationarity implies stationarity of 
{[XI,, [FI,, [Z1 Effectively, 

i/(X-;F")«/([Xl,,[rl,,[Zl,) 

which is now computable. 

Broadly speaking, input stationarity helps reduce an 
inhnite-dimensional problem to a hnite-dimensional one. 
This reduction is potentially useful for constructing upper 
bounds on the capacity^. With a “good” /, one may hope 
/ ([X]j, |y]j, |y]j) provides a tight upper bound. With a 
matching lower bound, the exact capacity can be deduced. 
We make a note that the advantage is not only computational, 
but also analytical, since we then only need to work with a 
hnite number of variables, instead of inhnitely many. We shall 
illustrate so for the case of the DID channel. 

^One can always restrict attention to stationary inputs to lower-bound the 
capacity regardless of whether stationary inputs can achieve the capacity. 


Eor the rest of this section, after introducing some math¬ 
ematical conventions and common dehnitions, we briehy 
discussed the two different channel dehnitions that correspond 
to the two frameworks, and their relation to practical channel 
models. We then give a brief description of the DID channel 
model with known results on its capacity and outline the main 
contributions of the paper. 

A. Mathematical Conventions 

To denote random variables, uppercase letters (e.g. V, X, 
Y) are used, and corresponding lowercase letters (e.g. v, x, 
y) are adopted for values they take. Their respective alphabets 
are in calligraphic style (e.g. V, X, V). denotes the vector 

{Va,Va+U...,Vbf for a < &. At times may be used in 
place of Vi (or Vq where appropriate). Whenever the length is 
not specihed or is implicitly understood, the vector is written 
in boldface, e.g. V (resp. v), in which case 14 (resp. Va) 
denotes its a-th entry. All vectors are understood to be column 
vectors. 

The notation (or {Vi}, depending on the starting 

index) denotes a one-sided random process. We only consider 
processes {Vi} in which S V Vi. This restriction is 
technically not an issue, since we can always expand the 
alphabet of each V. to the largest one, with an appropriately 
modihed probability measure that assigns probability 0 to any 
event that involves taking values that are not in the original 
alphabets. 

An event Ey (on a process {Vi}) is a set of values 
V that the process takes. At times we use the notations, 
e.g. {Vi = a}, to refer to an event (v : Vi = a}. A similar 
meaning applies to e.g. (V) = a or I^ = 6}. We also write e.g. 
{Va~'~'" = viV 2 .--Vb+i}, where vi,V 2 , ■.■,Vb+i G V, to mean 
{Va = Ui,I4-|-l = t'2, ■■■, Va+b = Vb+l}. 

The letter X is used specifically for the channel input, and 
the letter Y is for the channel output. 

The operator j-j returns the length of a vector, e.g. \Va \ = 
& — a -I- I, the absolute value of a scalar quantity, or the 
size of a set. The expectation operator is IE[-]. The binary 
entropy function is h 2 {x) = —a; log a; — (1 — a:)log(l — a;). 
We write [x, y] to denote the concatenation of vectors x and 
y, i.e. [x,y] = (a;i,..., a;|x|, t/i, ?/|y|) . When the alphabet 
is binary, we use ^x to denote the vector obtained by flipping 
all bits in x, and -^E to denote {^x : x G E}. 

Throughout this paper, log is understood to be of base 2. 

B. Definitions 

Let T denote the left-shift transformation. That is, e.g. for a 
vector {vi,V2,..y, we have T {vi,V2, ■■■)^ = {v2,V3, . 

Also, let T~^E = (v : Tv G E}. 

We refer to “hnite-dimensional” events as cylinders. That 
is, a cylinder E on the process takes the form 

cr (G) = {v : {vt,Vt+i,. ■ .,Vt+rn-i) G G} for some G C 
V"*. Also, cylind (f, to, V) = {c[" (G) : G C V™}, the set of 
all cylinders with “starting index” t and length to over V. Eor 
hnite to and V, cylind (f, to, V) is hnite-sized. 

A random process {14}, dehned over a probability space 
(HjT, P), is said to be stationary if VT G E, P {E) = 
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P (T It is iV-stationary, for some integer iV > 1, if 
yP gP, P {E) = P {P-^E). It is ergodic if P {E) = 0 or 
P (^E) = 1 for all invariant events^ E G P. 

The notion of stationarity can be extended to an n- 
dimensional probability measure P^^'> for finite n. That is, 

{ExV)= (p-^E) yp C 

It can be observed that if P is stationary and is an n- 
dimensional marginal of P, i.e. (P) = P ^ 

yP C V" for some k, then P*^") is stationary. 

We make a note on Kolmogorov consistency. A sequence 
{V^} is consistent, where V'^ is associated with probability 
measure P*^"^ for each n, if 

pl-^+i) (^>^+1) = p(") (^;") Vu" e V” 

for every n. That is, consistent {V^} can be described 
by a single probability measure, whereas inconsistent {I^”} 
is described by an infinite number of probability measures 
|p(")| Notice that the notation {Vn} implies a standard 
random process which must be consistent by definition, while 
{V^} is only a sequence in n. The two notations coincide if 
{F"} is consistent. We also note that if {!/"} is stationary, 
P-stationary or ergodic, it must be specified by a single 
probability measure and therefore consistent. 

1) Channel Definitions: We discuss the channel definition 
in two settings; the ergodic-theoretic framework (see e.g. [5]) 
and the Shannon-theoretic framework (see e.g. [1]). 

• In the ergodic-theoretic setting, a channel is defined as 
a list of probability measures {r'x : x € X°°} which act 
on output events Ey C y°°, where each x is either a 
bi-infinite sequence (i.e. x = (..., x-i,xo, xi,.. .)) or a 
uni-infinite sequence (i.e. x = {xk,Xk+i, ■ ■ ■) for some 
finite k G Z). This definition only admits inputs that 
are consistent random processes. The joint input-output 
distribution w for an input process ^ is determined by 

u}{Ex,Ey)= [ iy^{EY)dfi{x) ( 3 ) 

J Ex 

for any events Ex and Ey on the channel input and 
output respectively. Here we only consider uni-infinite 
input sequences only, which correspond to one-sided 
processes and one-sided channels. The literature on bi¬ 
infinite inputs (and correspondingly two-sided channels) 
is more extensive, but lacks the descriptive power for the 
channel model of application in this paper, i.e. the DID 
channel. 

• The Shannon-theoretic definition of a channel is a se¬ 

quence of finite-dimensional conditional probability mea¬ 
sures {Pyix" (’lx), X G in which Y can take 

any abstract alphabet. This definition avoids placing any 
restrictions on the channel and allows inputs that are not 
necessarily consistent and take the form {Y”}. The joint 
input-output distribution is determined for each block 
length n, i.e. 

Px^Y (tr",y) = Py|X" (tr”) (4) 


where Y" ^ and it is not necessarily consistent 

throughout all n’s. 

Compatibility between the two definitions may not be entirely 
immediate. On one hand, the sequence in the Shannon- 
theoretic definition may not be viewed as converging to a 
limit equal to the ergodic-theoretic definition without careful 
justification of such a limit. On the other hand, the lack of 
channel laws Py|X" (’lx) for finite block lengths n in the 
ergodic-theoretic framework has been noted to pose technical 
difficulties in proving coding theorems [4]. The difference 
in the channel definition, as noted in [1], also leads to a 
difference in how the error probability is defined. Despite 
this discrepancy, the capacities under the two frameworks 
are of little difference in the overall implication to reliable 
communications in the asymptotic regime of infinite block 
length. Depending on the actual channel model, one may find 
either framework suitable. In general, we shall treat the two 
frameworks separately. 

2) From Channel Definitions to Channel Models: The 
channel definitions abstract away from specific channel models 
and form the basis under which reliable communications is 
defined. Practical channel models, however, are usually not 
described in terms of probability measures as in the channel 
definitions. When the ergodic-theoretic channel definition is 
applicable to a channel model, we mean that there exists 
a list of probability measures {v^ '■ x G X°°} such that for 
every consistent input, the joint input-output distribution can 
be described by Eq. (3). Likewise, when the Shannon-theoretic 
channel definition is applicable to a channel model, there 
exists a sequence of finite-dimensional conditional probability 
measures {Py\x^ ('lx), x G such that for every 

input {Y" ^ the joint input-output distributions can 

be described by Eq. (4). 

While the two frameworks associated with the two channel 
definitions are handled separately, when both channel defini¬ 
tions are applicable, the observation that the measures Ex and 
Pyix" ('lx) coexist for a channel model can be exploited. 
In particular, in such cases, the joint input-output distribution 
can be described either ways, thereby allowing certain results 
from one framework to be used in another. 

C. DID Channel Model 

The DID channel model is described by 

Y = 

where ^ first-order binary Markov process, inde¬ 

pendent of the binary input. Note that the starting indices of 
the state process and the output are 1 and that of the input is 
0. Here P {Zi = = 0) = pi, the insertion probability, 

and P {Zi = 0\Zi_i = 1) = Pd, the deletion probability. 

An illustration is given in Eig. 1. It could be observed 
that every insertion must be followed by a deletion and vice 
versa, i.e. insertions and deletions are paired; hence the name 
dependent insertion-deletion channel. 

This channel describes a simplified model for the BPMR 
write process, capturing certain key features of the errors 


^An invariant event E is defined to satisfy T = E. 
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Fig. 1. An illustration of the DID channel. The circled input bits are deleted, 
and the circled output bits are inserted. Each Zi = 0 ^ -^i+i = 1 transition 
induces an insertion, in which the inserted bit is the same as the last input 
bit, whereas each Zi = 1 ^ -^t+i = 0 transition induces a deletion. 


introduced in this process. As mentioned, the aim of this tech¬ 
nology is to achieve ultra-high storage densities, envisioned at 
10 Tb/in^ and beyond [11]. Conventional magnetic recording 
media comprise of successive magnetic units, each of which is 
written with one data bit, represented by the magnetic state of 
the unit. As the density increases, interference from adjacent 
units produces graver effects on the magnetic state, hence 
degrading the reliability of the process and placing a limit on 
these media. BPMR, as one of the responses to this problem, 
separates these units with non-magnetic materials in between. 
However, as another challenge with increasing densities, the 
write head does not shrink proportionally with the size of 
each magnetic unit; in fact, it may span over multiple units. 
Difficulties thereby arise in controlling the position of the write 
head relative to the units, leading to timing mismatches. There 
are two typical erroneous scenarios: one is when the write head 
lags behind the next unit and fails to write the data bit on this 
unit, and the other is when it advances beyond the intended 
unit which is hence written with the past data bit. The hrst 
eri'or is modeled by the transition = 1 —>■ = 0, and 

the second error corresponds to Zi-i = 0 ^ Zi = 1. 

For more details about the BPMR, see [12], [11], [13], [14]. 
A more informative justihcation of the DID channel model 
can be found in [8], [9]. It has been pointed out that this 
model does not capture all error types in the BPMR write 
process, which motivates another channel model in [10]. They 
are however similar from an information-theoretic perspective. 
Since this paper concerns with the capacity aspect of these 
channels, the simplicity of the DID channel is appealing, 
making the model suitable for illustrating our information- 
theoretic results and capacity-evaluating techniques. 

A number of bounds on the DID channel capacity have 
been established in [8] and [9]. In particular, in [8], an 
upper bound, given by 1 —piPd/ {Pi + Pd) and termed genie- 
erasure upper bound, was derived, and a numerical simulation- 
based lower bound, which is in fact the achievable rate Ciud 
with independent and uniformly distributed (i.u.d.) input, was 
computed for all channel parameters Pi and pd- The specific 
case of Pd = 1 was analyzed in [9], which provided a finite- 
lettered expression of Ciud and also the genie-erasure upper 
bound. However these upper and lower bounds, as seen in 
[8, Fig. 10, 11] and [9, Fig. 2], are relatively distant. (We 
reproduce their bounds, plotted as dashed curves, in Fig. 4 
and 6 below, for greater convenience of the readers.) We 
emphasize that both works relied on the capacity formula for 


indecomposable channels^. 

In later sections, we restrict our attention to pi G (0,1) and 
Pd G (0,1). A simple continuity argument yields results for 
special cases where pi G {0,1} or pd G {0,1}. Henceforth 
we say e.g. pd = 1 to mean pd is very close to 1. 


D. Summary of Contributions and Structure 


As said, a main contribution of this work is the realization 
of the stationarity condition in the supremizing input set. In 
Sections II and III, we explore this in capacity formulations for 
the DID channel model in both ergodic-theoretic and Shannon- 
theoretic frameworks. 

• The ergodic-theoretic literature on such formulation is 
vast. However to the best of our knowledge, the theory 
developed for one-sided channels contains solely forward 
coding theorems, which establish achievability results on 
the rate [5]. Converse theorems, which concern with 
inachievability, are unfortunately missing^. In Section 
II, we verify that a “capacity” formula is applicable 
to the DID channel. This formula encompasses other 
known achievable rate formulas for one-sided channels. 
For simplicity, we shall refer to this formula as a capacity 
formula. 

• To realize input stationarity in the Shannon-theoretic 
framework, we introduce new dehnitions on consistent, 
stationary and ergodic channels, and prove that SE inputs 
can achieve the capacity of the finite-alphabet class of 
these channels in Theorem 11. We also verify its applica¬ 
bility to the DID channel. Note that the formula obtained 
here is the true capacity. These are done in Section III. 

Further towards the aim of computational suitability, a portion 
of these two sections introduces the notion of bit-symmetry 
and proves that we can further restrict the input search space 
to bit-symmetric inputs in Proposition 4 (and to a lesser 
extent. Proposition 14) when the channel is, in addition, bit- 
symmetric. 

While it is not conclusive whether the ergodic-theoretic 
formula is the true capacity, it is the same as the one in 
the Shannon-theoretic framework. Subsequent sections are 
devoted to evaluations of this single formula. In Section IV, 
a new lower bound is derived analytically in a finite-lettered 
computable form. A series of new computation-based upper 
bounds is formulated in Section V. These bounds are shown 
to yield improvements over those in [8] and [9] and are tight 
at low noise levels. Inspired by this observation, we then 
characterize the DID channel capacity in the low-noise regime 
in Section VI for the case Pi = pd = Pid, given by 


(7 = 1- 


E 

U=i 


2^=+!^^ (2 2 


oipl 


^Although indecomposable channels are under the Shannon-theoretic 
framework, with respect to our ergodic-theoretic capacity formulation in 
Section II, the achievable rate Ciud is in fact the same, and the genie-erasure 
upper bound can be easily established using the same argument as in [8]. 

^On the contrary, the theory for two-sided channels is quite complete. We 
point to some representative references [4], [5] for excellent summaries of the 
results. In particular, [5] contains results that are applicable to both one-sided 
and two-sided channels. References [15], [16], [17], [18], [19] contain results 
that are proven in the two-sided setting and correspond to tools used in this 
paper. 
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This characterization is achieved (up to the order of pfj) by 
the i.u.d. input. The paper concludes with Section VII. 

As shown in [14], in models that closely mimic the actual 
BPMR write channel, the occurrence frequency should be 
almost the same for both insertion and deletion errors. For 
this reason, while our bounds are formulated for any pi and 
Pd, we illustrate the numerical evaluation of the DID channel 
capacity mostly for the case Pi = pd= Pid- It is also shown in 
[14] that the occurrence frequency could be as low as 10““^, 
which justifies our interest in the low-noise regime. 

II. Ergodic-Theoretic Capacity Formulation 

In this section, we find a “capacity” formula that involves 
stationary inputs for the DID channel under the ergodic- 
theoretic framework. As a reminder, this formula yields an 
achievable rate, not the true capacity, and we consider one¬ 
sided channels only. Our main reference is the work [5]. 
We also develop the notion of bit-symmetry in this setting. 
Roughly speaking, bit-symmetry for binary inputs is the 
property in which the probability of drawing a binary string 
X is equal to the probability of drawing ^x. Likewise, a 
bit-symmetric (binary) channel is one in which the posterior 
probability of the output y given the input x is equal to that 
of given ^x. Intuitively a bit-symmetric channel should 
attain its capacity for some bit-symmetric inputs. We show 
that this is true for certain channels. 

A. Ergodic-Theoretic Capacity 

We review some definitions under the ergodic-theoretic 
framework. 

• A random process {Vn}'i^^i, defined over a probability 
space (O, E', P), is said to be asymptotically mean sta¬ 
tionary (AMS) if WE G E, the limit 

^ n— 1 

Pams (E) = lim - V P {P-^E) 

i=0 

exists. Pams is called the stationary mean. A stationary 
processes is also AMS. 

• A channel ■ x G X°°} is stationary if 

i/x (T~^Ey) = vtx (Ey) for any output event 

Ey C It is known that the joint input-output 

distribution uj (and hence the output distribution 
ti^Ey) = uj{X°°,Ey)) is stationary given a stationary 
input p and a stationary channel [5, Lemma 9.3.1]. 

• A channel is said to be AMS if given an AMS input, to is 
also AMS. A stationary channel is also AMS [5, Lemma 
9.3.2]. 

• A stationary (resp. AMS) channel is ergodic if uj is 
ergodic, given any stationary (resp. AMS) and ergodic 
input distribution p. 

Let us first consider stationary channels, in which the de¬ 
velopments are more straightforward. In the ergodic-theoretic 
literature, coding theorems are established for finite-alphabet 
stationary and ergodic (SE) channels with various assumptions 
on the channel memory and anticipation. A simple class is the 
class of finite-input-memory and causal channels. A channel 


is said to have finite input memory if there exists a natural 
number m such that for any n, 

i/x {T-^Ey) = Vx {T-'^Ey) 

for any x and x such that and any event Ey- 

A channel is causal (or without anticipation) if for any n, 


Z.X ({V- = 2/"}) = I.X ({r- = y"}) 

for any x and x such that x" = i" and any y". 

For the DID channel, we use Px^, Px^x and 
^x\x°° ('1^) place of p, uj and respectively. The 
model can be described under the ergodic-theoretic channel 
definition as follows. For an output sequence Y = y, given 
an input sequence = x, each triplet {yi,Xi,Xi-i) cor¬ 
responding to the pair (x, y) uniquely determines the occur¬ 
rence of either one of the four events {Zi = 0}, {Zi = 1}, 
{Zi = 0 or Zi = 1} and {Zi G 0}. Let us denote the deter¬ 
mined event £ {Zi-,yi,Xi,Xi-i). Also, for some output event 
Ey, let us define 

OO 

£{x,Ey)= IJ f^£ {Zi;yi,Xi,Xi_i) 

YGEy ^—1 

Then the channel law Py^x°° ( I^) determined by 


(Sylx) = f dP^lx^ (y|x) 

J Ey 

= J dPz ^f]£ {Zi]yi,Xi,Xi-i)^ 


(a 


(yEy)) 


(5) 


where Pz is the probability measure of the {Zi}'^^ process. 
The step (a) is by the following reason: £ {Zi\yi,Xi,Xi-x) 
and £ {Zi]y'i,Xi,Xi-i) are disjoint for yi ^ y', so 

{Zi-,yi,Xi,Xi_i) and i'S’ {Zp,y'^, Xi, x,_i) are dis¬ 
joint for y 7 ^ y'. 

It is easy to see that the DID channel model is finite-input- 
memory and causal. Combining [5, Theorem 12.6.1] and [5, 
Lemma 12.4.2], the capacity of the DID channel is given by® 

C= sup lim (6) 

Stationary tl 

= sup lim -[iT(Y”)-iJ(F"|^o”)] (7) 

Stationary n—j-oo 77, 

= sup [ lim FT 
- lim 

n—>-oo ' 

if it could be shown to be SE. The third equation follows from 
causality of the channel and the fact that the joint input-output 
distribution is stationary. We note that the starting indices in 
the formula are chosen to suit the DID channel description. 


®The statement of [5, Theorem 12.6.1] applies to AMS J-continuous 
channels. Finite-input-memory and causal channels are a special case of d- 
continuous channels. Also, as mentioned, a stationary channel is AMS. 
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although this is not critical for the following reason. For any 
hnite non-negative a and b, 

I^X^+a.yr) =IiX^;Yn+l{XZlx:t^;Yr\XS) 

< J(Xo";y-) + (a + 6)log|A’| 
and that I Y{^) > I {X^-Y^). Then: 

hm -/(Xo";ir)= lim -/(X!!+“; ^ 1 ") 

n—)-oo n n—>-oo U 

since (a -f h) log |<T| is hnite. 

We now argue that the DID channel is stationary under the 
condition that stationary, which can be achieved 

with a suitable initialization: P {Zi = 0) = pd/ {pi + Pd) and 
P {Zi = 1) = pij [pi Ypd)- For any output event Ey, 


PyIx~ {Ey\T^) = Pz (f (Tx,Pk)) 


Pz{T-^£{Tk,Ey)) 

( 00 

U {Zi+i\yi,Xi+i,Xi) 

YGEy i=i 

( 00 

IJ f^£ {Zi-yi_i,Xi,Xi-i) 
Y&Ey i=2 


— Ez I £ {Zi , yi , Xi , Xi— 1 ) 

\yeT-^EY i=l 

= {T-^Ey\^) 


where (a) is because Pz is stationary. Stationarity of the DID 
channel is thus proven. 

To prove ergodicity, we refer to the following dehnition. 
A channel is said to be output strongly mixing if for every 
X S X°° and cylinders Fi and P 2 on the output, 


lim |j 2 ^(T-”PinP 2 ) 

n—>-<50 ' ' ' 


-e^{T-’^Fi)e^{F2)\=0 


Lemma 1. A stationary channel is ergodic if it is output 
strongly mixing. 


Lemma 1 is an easy consequence of [5, Lemma 9.4.3]. Now 
for any two cylinders Pi and P 2 on the output, for sufficiently 
large n, we have: 

P^\x- (T-"PinP2|x)- 

Px\x- {T-^PA^)Px\x- iF2\^) 

= Pz(£:(x,T-"PinP2))- 

Pz{£ (x,r-"Pi))Pz(£: (x,P2)) 

\Pz {T-^£ (r"x, Pi) n £ (x, P2)) - 

Pz (r-"£ (r”x, Pi)) Pz {£ (x, P2)) 


To see (a), suppose Pi and P 2 take the forms (Gi) and 
(G 2 ) for, respectively, some Gi C and G 2 Z y^^- 


Then for n > ^2 + 'm 2 — ti- 

£(x,T-”PinP 2 ) 

00 

^ : Vi 1 1 1) 

yGT-^FinF2 

n+ti+mi —1 

u n ^ yi—n—tiFli ^i—l) 

y^Gi i—nFti 

t2Fm2 — l 

n u n £{Z, 

1 yi — t2 + l-) l) 

yGG2 i^t2 

ti +mi —1 

u n £{Z^-, yi—ixFli ^n+i—l) 

yeGi 

t2Fm2 — l 

n u n £{Z, 

! yi —*2 + 1’ 


= T 


yGG2 i=t2 


= P- 


u n £ {Zii yi; XnYi—l) 

ye El i=l 


£ {Zi , yi , Xi , Xi—i ) 

yeF2 i=l 
= T-”£ (P"x,Pi)n£(x,P 2 ) 


Replacing P 2 by (3^"*^) in the above, we also obtain 
£ (x, T“"Pi) = P“”£ (r"x, Pi); hence step (a) is shown. 
The convergence with n —> 00 in the last step is justihed 
as follows. Under the aforementioned distribution of Zi, the 
process {Zi}°f^ is mixing [20], i.e. for any events Pi and 
P 2 and any e > 0, there exists a hnite n,. (Pi, P 2 ) such that 
Vn > rie (Pi, P 2 ), 

|Pz (T-”Pi n P 2 ) - Pz (r-"Pi) Pz (P 2 )| < e 


With Pi = (Gi) and P 2 = (G 2 ) as above, let 

= |ne (Pi,P 2 ) Pi e cylind(ti,mi,Z), 

P 2 e cylind (f 2 , ^ 2 , -Z) | 


which is a hnite set, since cylind (p, mi, Z) and 
cylind (^ 2 , TO 2 , 2^) are hnite. Hence maxPg exists and 
is hnite. Then: 


Pz(T-"£ (P"x,Pi)n£(x,P 2 ))- 


Pz (T-"£ (P"x, Pi)) Pz (£ (x, P 2 )) 


< e Vn > max 


since £(r”x. Pi) € cylind (ti, mi, 2) and £(x, P 2 ) G 
cylind (^ 2 , ni 2 ,2). This completes the establishment of the 
DID channel’s ergodicity. 

With other initializations, can we still achieve a rate equal 
to that in Eq. (7)? The answer is positive, even though the 
DID channel turns out to be AMS in this case. This is 
shown in Appendix A, which is an interesting application of 
the connection between the ergodic-theoretic and Shannon- 
theoretic frameworks. 
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B. Bit-Symmetric Channels: Ergodic-Theoretic Setting 

Definition 2. A binary probability measure /r (i.e. one that 
acts on GF (2)) is bit-symmetric if gL {E) = fi {^E) Vi? C 
GF(2)“. 

Definition 3. A binary channel, in which Xi,Yi G 
GF (2), is bit-symmetric if (Ey) = {~^Ey) Vx € 

GY {2)°° ,Ey C GF(2)~. 


Proposition 4. For any bit-symmetric binary channel, 

sup lim = sup lim-/(X”;y") 

Stationary /i Ti Bit-symmetric, el^oc fl 

stationary /i 


As before, as long as the starting indices are hnite and the 
ending indices are within hnite differences from n, they do 
not affect the result. 

Proof: The complete proof is given in Appendix C. The 
main idea is to dehne an input distribution /tq based on a 
stationary input p, such that po {E) = {fJ.{E) -f fj,{^E)) /2, 
then prove that the mutual information corresponding to po is 
at least that of p. Also, po can be shown to be stationary and 
bit-symmetric. ■ 

It is easy to see that the DID channel is bit-symmetric. Then 
from the above proposition. 


C = sup 

Bit-symmetric, 

stationary 


lim H(yJY^~^) 

. n—foo ' ' ' 


- lim iV(F„|y"-\Xo") 


( 8 ) 


This equation is pivotal to subsequent calculations of the DID 
channel capacity. 

The implication of the proposition is broader: capacity of 
a bit-symmetric channel whose form is similar to that of the 
DID channel can be attained with some bit-symmetric inputs. 
Although the bit-symmetry condition may not be as helpful 
to tightening capacity bounds as the stationarity condition as 
shall be seen in later sections, it offers an analytical advantage 
by reducing the number of variables representing an input by 
a half. 


C. Two-sided Channels 

Although the DID channel is naturally modeled as a one¬ 
sided channel, we make a note here on the applicability of the 
theory for two-sided channels. In this setting, all processes 
are two-sided. Roughly speaking, this means that the starting 
index of the processes is —oo. The difficulty of casting the 
DID channel as a two-sided one is that the state process {Zi} 
has an initialization. 

Consider a different model of the state process: {Zi] 
(or more explicitly, ^ (two-sided) stationary 

and ergodic binary hrst-order Markov process, in which 
P{Zi = l\Zi_i = 0) ^ Pi and P {Zi = 0|Zj_i = 1) = Pd- 
This implies that for any i G Z, P {Zi = 0) = Pd/ {Pi + Pd) 
and P {Zi = 1) = Pi/ {pi -\- Pd). Then the DID channel falls 
under the two-sided setting: 

P^\x+^ {Ey\:x.) = Pz i IJ Pi £ {ZGyi,Xi,Xi-i)\ 

\ yGSi' i=-oo j 


where x is a bi-infinite input sequence and Ey is a subset 
of the space of all bi-inhnite sequences (..., ?/_i, j/Oi J/ij • ■ 

Di G y. All properties we have showed in the one-sided 
case (stationarity, ergodicity, bit-symmetry) can be proven in 
a similar fashion. Its true capacity is given by [4] 

C= sup lim -I{X//;Yf) 

Stationaiy P 

= sup lim — 

Bit-symmetric, n.—:cxD n 
stationary 

which is essentially the same as Eq. (6) and (8). 

III. Shannon-Theoretic Capacity Formulation 
In this section, we depart from the ergodic-theoretic setting 
and formulate a capacity formula that involves input stationar¬ 
ity in the Shannon-theoretic framework. To this end, we dehne 
a new class of consistent, stationary and ergodic channels. 
Like the previous section, we also introduce the notion of 
bit-symmetry under the Shannon-theoretic framework with a 
similar result. We verify that the results are applicable to the 
DID channel model. 


A. Consistent, Stationary and Ergodic Channels 

Consider the following channel dehnition. A channel 
is specihed by a sequence of conditional probabilities 

I P I (’lx), X e ;p"+'^-+>‘+ I ^ in which A_ and 

I J„=i 

A+ are two hnite non-negative integer constants specihed by 
the specihc channel model. Here P (-lx) is a (hnite- 

dimensional) probability measure on r " and hence can admit 
any events on F" (e.g. {Y// = y} for 1 < fc < n). As usual, 

^y-|<+a+ = y}F) 

Also, the channel admits input sequences 
I -App : -App ~ P*-”^ I that are not necessarily 

consistent. In this context, P^'^> is understood to be 
(n-f A+-f A_)-dimensional. This channel dehnition can 
be easily seen to be a subclass of the Shannon-theoretic 
dehnition in Section I-B. It allows us to focus on channel 
models whose operation is described for each length-n output 
sequence F" (of starting index 1) given an input sequence 
■ In the case of the DID channel, A_ = 1 and A+ = 0. 
We expect that capacity-achieving inputs would be station¬ 
ary for channels that behave similarly to those SE channels 
of the ergodic-theoretic setting. This motivates the following 
dehnitions. 


Dehnition 5. A channel is consistent if for every consistent 

, the channel induces consistent 


n=l 


input sequence 

Dehnition 6. A consistent channel is weakly stationary if for 
any > 1, any A^-stationary input distribution Pxfy^ (or 

W-stationary input sequence < > ) induces a joint 

I ~ J n—1 




input-output sequence | , F” | that is TV-stationary. 

L ~ J n—X 

A consistent weakly stationary channel is ergodic if any SE 
input distribution induces a joint input-output sequence that is 
ergodic. 

By defining a consistent channel, we legitimize our defini¬ 
tion of weakly stationary and ergodic channels. It can be seen 
that for a consistent weakly stationary channel, if the finite¬ 
dimensional input satisfies the stationarity condition, 

i.e. 




P 


Yn + k\x' 


{{Yk +1 = y}|[x,x]) = p. 


Y’- X, 


. + A + 


(y|x) 


for any x e x € y € y” 


Lemma 8. A sequence 




(•|x) 


n—l 


such that 








y 


= ^n+l (vn+l 
for any y^+i- 

Proof: For any consistent |jVj 








n+A+ 

-A_ 


i n=l’ 




y„+l 


1-A_ 


Tl+1 + A_j_ 
yn + 1 

= E 

Xn+l+X^ 

Vn+l 


.( 

p{n+l) 


n+1 


y 


A-A. 


p 

E 

ti + 1 + A_|_ 
2/n+l 


y„+l y 




n+1 


"-l-l+A+\ 
'-l-A^ J 


p{n+i) X 


Y^\X 


n+A^ (y 

1-A_ V 


(a) 


= A-:;: 1-- 




n+A+\ 
^1-A_ J 


^1-A_ J 


P^r. + A , {ExX)= P^. + A+ (T-^E) VP c ^"+A++A --1 

Xi_A_ ^1-X_ 

the corresponding finite-dimensional input-output pair 


is stationary as well, since one can easily 
construct a stationary input Px^ ^ that retains the same 

marginal distribution on Therefore our definition of 

weakly stationary channels captures the stationary behavior 
for every block length. This is however not the case for 
ergodic channels, since ergodicity cannot be described 
by finite-dimensional measures. Our definition of ergodic 
channels consequently remains almost the same as the 
ergodic-theoretic definition. 

Definition 7. A channel is stationary if for any n, fc > 1, 


\/, y'^, for every n > 1, where (a) is because 

Ly {y y"", = 1- and (6) is because 

^ p(n+l) = pin) 

^n + l + A_|_ 

thanks to consistency of the input. ■ 

Lemma 9. For a consistent channel, if it is stationary, it is 
also weakly stationary. 

Proof: For any k, consider a /c-stationary input Pxp_^ , 
inducing a joint input-output distribution Px“ ^ y- For any 
n and any x e ^"+a+-i-a_^ y ^ yn^ 


Px? 


^X,F- = y}) 


~ E -^yn+)=|x"’''''+^+ “y}|[XiX]) 




X P 


X? 




Our definitions of stationary and weakly stationary channels 
can be viewed as Shannon-theoretic counterparts of, respec¬ 
tively, the classical and general definitions of the ergodic- 
theoretic stationary channel in [5], [19]. The following two 
lemmas are useful facts concerning these channels. 


xGY'" 


= P 


Y" X, 


(^) p 


forms a 


consistent channel if Ay"^for every n > 1, there 
exists a probability measure $n-i-i 2/") on F„+i 


E ^Yn\x:fi+ (y|x)^Y~._ = [x,x]}) 

_ (y|x)Px»,_ ({xr:tE"=x}) 

Y^|xr_^^+ =x}) 

= Px~,_Y ({xr_+^ = x,F" = y}) 

where (a) is because the input is fc-stationary. Hence Pxp_^ y 
is also fc-stationary. ■ 

B. Capacity Theorem 
Let 

,■(«) 




log 


Y" X 


"+^+ (j 

1-A_ A 


a-A_J 


E p 


Y" X, 


_ti + A_^ 




be the information density of x^^^^ and y" under the input 
^ pin)'^ . The mutual information is denoted by 


l(n) ; F”) = 


r j^n+A-j... _ yrn 


The superscript (n) implies the calculation is w.r.t. 
P I ,x+A^, and P^^\ and it is dropped without ambiguity 

\X, 


when the input and the channel are both consistent. 
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With respect to the channel definition stated previously, 
define the following capacity formula; 

C= sup I(X;Y) 


where 
I(X;Y) = 

sup |a S K : lim Pr ~ 

It is still an open question whether there is only one input that 
attains the supremum in C, so we generally do not assume 
such. As shown by Verdii and Han in [1], the above capacity 
is equal to the operational capacity, which holds with full 
generality for any point-to-point channel under the considered 
definition. Suppose that we restrict our attention to the class 
of SE inputs. We then have the following definition. 


C. Bit-Symmetric Channels: Shannon-Theoretic Setting 
Definition 12. A binary input | | , where 

^ pC^), is bit-symmetric if for any n > 1, = 

p(”)(^x) Vx G GF(2)"+^++^-. 

Definition 13. A binary channel < P i n+A+ (-lx) > 

I J„=i 

is bit-symmetric if for any n > 1, P |^ri+A+ (y|x) = 

Proposition 14, For any bit-symmetric binary channel, 


sup 


) 

lim —/ 1 


. + A+ 

n—foo fi 

\ 

sup 


lim — J 

it-symmetric 


cx> n—>-oo fi 

71=1 


n V 


n+A+ _ 
— A_ ’ 



(9) 


Definition 10. The stationary and ergodic capacity of a 
channel is 


CsE= sup I(X;Y) 

where the supremum is taken over all SE inputs. 

Theorem 11. The capacity of a consistent SE channel with 
finite alphabets is 


C = CSE 


sup 


SE 




lim —I 

n—foo Ji 


(A 


+-^+. V' 
-A_ ’ ^ 


i.e. SE inputs can achieve its capacity. 

Proof: The proof is a combination of ergodic-theoretic 
techniques, information-spectrum results and manipulations on 
information-theoretic quantities. We provide a sketch here; the 
complete proof is given in Appendix B. We first establish the 
second equality via the Shannon-McMillan-Breiman theorem, 
in which the information density normalized by n converges 
to the mutual information rate. To show that C = Cse, note 
that Cse < C trivially. We then have to show that the mutual 
information rate quantity is an upper bound on C. To do so, 
we use a technique in [15] to construct an SE input p. from an 
arbitrary finite-dimensional distribution on for some 

r G N'*'. We show that 


lim —In 

n—>-oo Ti 




^+>'+.V 
1-A_ > 


> 


1 

r -f A+ -f A_ 



r -t-A^. _ ^7" 
1-A_ > 


) 


where Ip is the mutual information under p. The right-hand 
side term is an upper bound on I (X; Y) in the limit r —>■ oo. 
Since p is SE, for any such p, 

CsE> lim 

ra->oo n \ / 


The proof is given in Appendix C. The idea is similar to 
the proof of Proposition 4. 

It is an open question whether the supremizing input set in 
Proposition 14 could further be reduced to the SE input set 
to match with Theorem 11, thereby allowing us to obtain a 
Shannon-theoretic result in parallel with Eq. (8). In the specific 
case of the DID channel, we shall answer this in the positive in 
the next section. Nevertheless, since the left-hand side of Eq. 
(9) is an upper bound on the capacity C given in Theorem 
11, the proposition helps in calculating this upper bound in 
general. 

D. Applicability to the DID Channel 

Reusing the notation £ {Zi\yi,Xi,Xi-i) in Section II-A, 
define 

n 

£ K, {Y^ = y^}) = f| £: {Zp y,, x,, x,.i) 

i=l 

Then the channel law, under the Shannon-theoretic setting, is 
specified by: 

(j/”ko) = Pz {£ (^0, = 2 /”})) 

We first verify that the DID channel is consistent. Noticing 
is consistent, one easily deduces that: 

— Pyn^X^ (y \^o) P(y (^'^+'Pty'n+li tXn+l, Xn) 

£{x^,{Y^ = y^})) 

But for any y” and 

'y ' P iy ^YrL-\-l\yn+l-: Xnyl, Xn) 

y„+iGGF(2) 

f (a:^,{Y" = y"})) =1 

and therefore. 


which completes the proof 


Pz^^,\z^ (£ (Y„+i; .,x„+i,a:„)|£- K, {Y" = y"})) 
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is equivalent to a probability measure +1 on 

Yn+i- By Lemma 8, the DID channel is consistent. 

Like Section II-A, under the initialization P {Zi = 0) = 
Pd/ {Pi+Pd) and P (Zi = 1) = Pi! {pi +Pd), we have; 

L’y,i+fc|x"+'' =y}|2^o''' ) 

— ^ ^ (-^fc+z 1 Vi 5 ^k+i 5 i)^ 

-^z ^ ^ {Y-i , y-i , ^ 

= Py^lx^ (yl^fe"*") 

which establishes the DID channel’s stationarity. It is also 
easy to see that the DID channel is bit-symmetric in the 
Shannon-theoretic sense. Finally, we recall that the chan¬ 
nel law F’yIx” ( I^) oxists for the DID channel under the 
ergodic-theoretic framework. As such, for consistent inputs, 
the joint input-output distribution of this model can always 
be described by Eq. (3). In Section II-A, we have shown 
that for the DID channel model, under the aforementioned 
initialization, the channel law F’y|x“ (’I^) ergodic (in the 
ergodic-theoretic sense), i.e. an SE input induces an ergodic 
joint input-output distribution. By our Shannon-theoretic def¬ 
inition of ergodic channels, the DID channel’s ergodicity is 
thus validated. 

Now we argue that initializations are irrelevant to the DID 
channel capacity under the Shannon-theoretic framework. The 
following capacity formula for indecomposable hnite-state 
channels (ESC), which are those with the initial ESC state’s 
effect vanishing with time, is well-known [3]: 

C= lim -sup/(X”;y") (10) 

n->oo n X" 

It can be proven that the DID channel is an indecomposable 
ESC for any Pi,Pd & (0,1) by an easy extension of the 
argument of the case pd = I presented in [9, Proposition 
11]. The initial ESC state is {Xo,Zo) as in that argument, 
where we extend the DID channel state sequence to Zq without 
affecting Zi, Z 2 , ■ ■which is possible for Markov processes. 
Since the DID channel is indecomposable, its capacity is the 
same for all distributions on {Xo,Zq). Hence for given pi 
and Pd, it can be calculated w.r.t. P {Zq = 0) = Pd/ {Pi + Pd) 
and P {Zq = 1) — Pi/{pi+Pd)- But this distribution on 
Zq induces the aforementioned distribution on Zi, which 
concludes our argument. 

Einally the observation that the ergodic-theoretic channel 
dehnition also applies to the DID channel model leads to the 
following result: 

C= sup lim ^I{X//-Y'^) 

Stationary, bit-symmetric Tl 

To see this, we first note that since the channel law 
Py|x“ ('I^) exists and is shown to define an ergodic- 
theoretic stationary (one-sided) channel, and the joint input- 
output distribution obeys Eq. (3), applying [5, Lemma 12.4.2], 



Fig. 2. The two terms in Eq. (7) for the i.u.d. input. 

we have; 

sup lim —I{Xq;Y^)= sup lim —I{Xq]Y'^) 

Stationary tt—t-oo Tl gj, n^oo Tl 

The formula for C then follows immediately from Proposition 
14 and Theorem 11. 

The formula we obtain is hence the same under both frame¬ 
works. In subsequent sections, we shall explore various ways 
to evaluate the right-hand side of Eq. (8), thereby yielding the 
same results for both frameworks. 

IV. DID Channel Capacity Lower Bound 

To derive a good lower bound, behaviors of the two terms 
in Eq. (7) have to be analyzed. Eig. 2 plots the terms for 
the case of i.u.d. input, using the method in [21], when 
Pi = Pd = Pid- It can be observed that lim„_,oo (L”) 
decreases slowly for low pid (e.g. pid < 0.5), whereas 
lim„_,oo -H {Y'^\Xq) varies drastically. As such, if we are 
to approximate the capacity for practical values of pid, the 
term lim„_,oo iY'^\XQ) should be carefully handled, and 
an estimation of lim„_,oo (L”) might be sufficient. 

To gain more insights into how the terms could be evaluated, 
from Eq. (8), notice the following correspondence; 

lim -H{Y'^)= lim H {YJY'^-^) 

hm -H{Y^\X//)= lim H {Yr,\Y^-\X//) 

Since the input is i.u.d., it can be shown that H (y„) = 1 
for any pid- With the fact that the hrst term is close to 1 at 
low Pid whereas the second term differs, one may say, only 
a few most recent past outputs in carry a majority of 

knowledge about y„ for low pid', however, when given Xq, 
farther past outputs are able to resolve more uncertainty about 
y„ and thus may not be ignored. While this discussion pertains 
to the i.u.d. input only, the (only) capacity-achieving input 
is i.u.d. when pid = 0 (i.e. the channel is noiseless) and so 
by a continuity argument, at low pid, the inputs that achieve 
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aPd , f ^ , Pi{l-apd) . pdil + ap^-a) 

-h 2 [a - api) H-^-/12 (^1) H-,-n.2 [A2) 


Pi +Pd 


Pi +Pd 


Pi +Pd 


1- a - 2apd + apl + - ‘2a'^pl + aPiPd - cPpiPd 

1 - apd 

1 + 2api — 2a — apf — 2a^Pi + + a^p^ — apdPi + 2a^PiPd 

1 + api — a 


( 12 ) 


lim H{YJY’ 

n—^oo '■ ' 

00 


"-\^o) = 


a) 


k-l 


k=l 


Pd , I Pd + Pi (1-Pi-Pd) 
“2 


Pi + Pd 


Pi +Pd 


Pi { Pl+Pd - Pi - Pd) 


Pi +Pd 


Pi + Pd 


(13) 


the capacity should behave nearly i.u.d.-like and the insights 
drawn are thus expected to be useful. 

Recall that the lower bound established in [8], [9] is the 
achievable rate with i.u.d. input Ciud- Now to derive an 
analytical lower bound that improves on Ciud, not only do 
we need to consider a more complex input distribution, but 
it also cannot be too complicated to analyze. An immediate 
candidate is its fortiori, a stationary bit-symmetric first-order 
Markovian input; Px„+i\x„ (1|0) = Px„+i\x,, (0|1) = ct and 
PxgiO) = Pjro(l) = 0.5. We evaluate each term in Eq. (8) in 
the following. 


A. The first term 

As discussed, an estimation of this term that retains the first 
few recent past outputs is sufficient. We lower-bound the term 
with discarded: 

lim > lim H {YjY^~'^,X^-^,Z^-‘^) 

n—^oo ' ' ' n—^oo ' ' ' 

lim H{Yu\Yu-i,X'S-\Z^-^) 

n—>-oo ^ ' ' 

lim H{Yu\Yu-i, Xu- 2 , Zu- 2 ) 

n—^oo 

H{Yu\Yu-1,Xu-2,Zu-2) (11) 

where (a) is because Yi is a function of Zi and Xl_-^, (b) is 
because 


^ {Xu- 2 , Zu- 2 ) ^ {x:_„z:_,) ^ 

i.e. they form a Markov chain in that order, and (c) is due to 
the following: 


{yi,y2,x,z) 

(yi|a^i, 2 : 2 ,2i) X 

-Py„_i|x„_iX„_ 2 Z„_i {y 2 \x 2 ,x, Z2) y. 
{^1^^2,Z) Px^_^ {Xi,X2,x) 

^y„+i|x„+iX„z„+i {yi\xi,X2, zi) X 
^y„|X„X„_iZ„ {y2\x2,x, Z2) X 


= E 


aii,rc2 

21,22 


E 

^ 1,^2 

21,22 


P^n + l {zi,Z2, z) P^n + 1 {xi, X2, x) 


= -Py„+iy„x„_iZ„_i {yi,y 2 ,x,z) 

for any {yi,y 2 ,x, z) and n > 3, where we make use of the 
fact that Yn is a time-invariant function of {Xn, Xn-i, Zn) for 
any n, the considered input is stationary, and the capacity is 
computed w.r.t. stationary as mentioned in Section 

II-A. H {Yu\Y„-i,X„- 2 ,Zu- 2 ) is then given by Eq. (12), 
given that attains its stationary distribution as stated. 

Eor a better approximation, repeating the same argument 
above, we have: 

lim H {Yu\Y^-^) > H {Yu\Y;^Z^,Xu-u-i,Zu-u-i) 

for some finite k > 1. However at relatively low noise, k = I 
suffices. 


B. The second term 

Lemma 15. lim„_>oo H Alg ) is given by Eq. (13), 

in which a = P {Xi Xi-i), for any stationary input. 

Proof: a exists thanks to stationarity of the input. We 
make a few observations: 

(Ob.l) Given Xi f Xi-i, knowledge (or respectively, un¬ 
certainty) about Yi is equivalent to knowledge (or respectively, 
uncertainty) about Zi. 

(Ob.2) Given Xi = Xi-i, uncertainty about Yi is com¬ 
pletely resolved, and knowledge of Yi provides no information 
about Zi. 

(Ob.3) Without knowing Yi, knowledge of Xi also provides 
no information about Zi, since the input Xq and the channel 
state Z" are independent. 

(Ob.4) As for resolving uncertainty about Zi, knowledge 
of and Xq~^ is only as good as to provide (partial) 

knowledge of Z'^~^. Eurthermore, since Z" is a first-order 
Markov process, in the resolution of uncertainty about Zi, for 
q < k < i — 1, knowing Z^ (without knowledge of Z{ff\, if 
fc < 1 — 1) is the same as knowing Z^. 
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Pz^\z„.,m Pz„|z„_.(l|l) 


1 

Pi +Pd 


Pd + P^ - Pi - Pd)'" Pd-Pd{^-P%-Pd)^ 

Pi-Pi{.^-Pi-Pd)^ Pi+Pd{^-Pi-Pd)^ 


(14) 


H {Zn\Zn-k) — 


Pd ^ (W+Pi{^-Pi-Pd) 


Pi +Pd 


Pi+Pd 


Pi ^ (^ + Pd{^-Pi-Pd) 


Pi +Pd 


Pi+Pd 


(15) 


y^tO _ 

'^Mkv — 


= max 
oc£[0,l] L 




x° 


Markov ( 


(16) 


We then have; 

H{Y^\Y^-\XS) 

= H (y„|y”-i,Xo",x„ = x„_i) P(X„ = x„_i) 

+ H ^ x„_i) P(X„ ^ x„_i) 

= aH (Z„ |y”-\ Xo", ^ X„_i) 

= aH 

= ai? ^ X„_2) 

+ (l-a)il(Z„|y”-\Xo"-\X„_i =X„_2) 

= aH (Z„|Z„_i) + (1 - a) i/ 

Similarly; 

= ai? (Z„|Z„_ 2 ) + {l-a)H {Z^\Y^-\X^-^) 

H (Z„|ri, Xi) = ai? (Z„|Zi) + (1 - a) i? (Z„) 

We are then left with evaluating H {Zn\Zn-k) for k = 
1,... ,n — 1. Consider 

Pz^iz^.Am Pz^\z^.,m) ■ 
Pz„\z„.,m Pz„\z„.Am _ 

Pi Pd 
Pi 1 - Pd _ 

We then obtain Eq. (14). Subsequently H {Zn\Zn-k) is given 
by Eq. (15). Putting everything together, we obtain the lemma. 

■ 

When the input is the aforementioned first-order Markovian 
process, a given in Eq. (13) is also Px^+i\x„ (1|0) = 
-Px„+i|x„ (0|1). 

Now combining Eq. (12) and (13), the new lower bound 
P'Mkv 1® *^6en given by Eq. (16). The maximization is reduced 
to a univariate one (i.e. maximization in a), which can be 
computed efficiently. The results are given in Eig. 3 and 4. It 
can be observed that C^^kv improved over Ciud, given in 
[8], for most values of pid- We also note, the fact that 
(which is C^Mkv "'I'-l* unoptimized a = 0.5) is close to Ciud 
shows that the estimation in Eq. (11) is not a bad one as 
previously expected. 



Fig. 3. Optimized a for 



Fig. 4. New lower bound Here is with unoptimized 

a = 0.5 (i.e. i.u.d. input) for all pi^. 


The bounding technique here relies on the exact establish¬ 
ment of the second entropy term. While Lemma 15 is heavily 
channel-dependent, we make a short note on how the strategy 
could be extended to calculate the entropy term when Zi has 
a larger alphabet. Eor example, consider Z = {0,1,2}, and 
the term is then lim„_>oo Consider the 
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following 4 variables: 

Oil = P i^n = Xn_i = Xn-2) 
a2 = P(X„ = X„_i^X„_2) 
as = P (X^ ^ X„_i = X^_ 2 ) 
ai = P {Xn-l ^ Xn = Xn-2) 


Therefore: 

H{Yn+c\Y'^+^-\X^+^) >H{Yn+c\Y^+^-\X^+^,Zn) 

= H(jn+cK+f-\X:+^,Zn) 

whose right-hand side is independent of n for the same reason 
that leads to Eq. (11). Then: 


{«!, Q! 2 , as, 0 : 4 } plays the role as a in Lemma 15 and can 
be reduced in size via the stationarity condition and also the 
fact ai Y a 2 Y as Y On = 1. The key to the extension is to 
realize that similar to (Ob.l) and (Ob.2), given an event that 
corresponds to one among ai, 0 : 2 , 0 : 3 , a 4 , the uncertainty of 
Yi could be partially reduced to that of Zi, and knowledge of 
Yi is partially equivalent to that of Zi. For example. 


H{Yn\Y'--\X^_^,Xn^Xn-l=Xn-2) 


= H 
= H 
H 


Y'^-\X\,Xn^Xn-l=Xn-2) 


(^Z(^^)^Y^-\X)lj\Xn-2 ^ Xn-l = Xn-s) 


^71—1 


= H 
= H 


{zY> 

{zY 


y(02) -^n—1 V J- V _ V I 

^ —1 ’ -^n—2 ^ -^n—1 — ^n—3 J 


where Zn^'’ = I (Z„ S {j, k}). Here I (>1) is 1 if A is trae 
and 0 otherwise. 


V. DID Channel Capacity Upper Bound 

In this section, we formulate a new series of computable 
upper bounds with a parameter C. The upper bounds improve 
as C increases, and match up with the developed lower bound 
at low noise levels. We also discuss the crucial role of the 
input stationarity condition in the new upper bounds, without 
which the bounds could be trivialized. 


A. Formulation and Computation 

An upper bound is usually difficult to establish analytically, 
since it involves maximization over all input distributions, 
unlike lower bounds. We shall rely on computational methods. 
To do so, we assume stationary inputs and bound each term 
in Eq. ( 8 ) as follows. Let £ > 1 be a given finite parameter 
to control the accuracy of the to-be-formulated upper bound. 
For any n> 1, we have: 

H < H {Yn+cK+f-^) 

Note that the right-hand side is independent of n by station¬ 
arity. As such, 

lim H < H (Yn+cK+f-^) (17) 

Next, notice that in the resolution of uncertainty in T), given 
Xl_i, other random variables can help at best by providing 
information about Zi. Together with observation (Ob.4) made 
earlier, we have the Markov chain 

^ {Y:+f-\X!^+^,Zn) -7 Yn+C 


lim H{Yn\Y^-\XS) >H{Yn+c\Yn+i 

n—>-oo ^ ^ \ I I 

By Eq. (17) and (18), letting 


n-\-C—l vn-\-C y ^ 
5 ^n) 


(18) 


= H {YnncK+f-^) - H {Yn+c\Y:+i^-\x:^+^,Zn) 


we have 

C<Cf= sup Cf(p^^+c) 

Stationary, bit-symmetric ^ 

Then gives a capacity upper bound, controlled by £. 

To compute this bound, we turn to the following lemma, 
which is a straightforward exercise. 


Lemma 16. Given vectors Wi,qi € K" and variable u S K", 
define a function f{u) = — where 

dom(f) = {u : wju > 0 Vi}. Then f is a concave function. 


It is easy to see that iV (y„+£ 


, ‘ n+l ’ ^'-n 5 IS 

affine in and H {Yn+c\Yn+i takes the form of 

function / in the above lemma with P^n+c being the variable 

u. Then C'^'' 
following: 




is a concave function. We note the 


• Px n+r being a valid probability measure means 

Px n+c (x) = 1 (unity-sum condition) and 

Px 71+4 (x) > 0 (non-negativity condition) Vx G 
GF(2)^+V 

• The stationarity condition is linear and hence can be 

converted into the form Spx — 0 , where px = 
[ ... P^n+c (x) ... ] and S' is a matrix. For exam¬ 

ple, when £ = 1 , we have: 


Px 


- ( 0 , 0 ) - 

Px^vi ( 0 , 1 ) 

P^„+i(l,0) 

- ( 1 ’ 1 ) - 


S= [ 0 1 -1 0 ] 


which represents Px„ ( 0 ) = Px^+i ( 0 ) and Px„ ( 1 ) = 
Px„+i (!)■ In general, S can be constructed efficiently 
by computations. The following lemma indicates that the 
number of rows in S is at most 2^ — 1, i.e. it grows 
linearly with the size of px and so using the stationarity 
condition in computations is not too costly. 


Lemma 17. For any finite n and random vector U" G GF(2)" 
with probability distribution P, it is stationary if and only if 
for any v G GF(2)"“^, 


P(U"-1 =v) =P(U 2 "=v) (19) 

In fact, if Eq. (19) is satisfied for any 2"“^ — 1 vectors v from 
GF(2)"“^, it is also satisfied for the other v G GF(2)"“^. 
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Proof: Let us consider the first claim. The forward is 
immediate. We prove the converse. Since L" is dehned over 
GF(2), any event on it implies that a subset of entries in 
{Vi : i = l,...,n} takes up a specihc value. Consider 
an arbitrary event E = {Vi = Qi, i G Ne} for any Ne L 
{1, ...,n — 1}. Then if for any such event, P{E) = P(T~^E) 
for s = — max Ne, P is stationary. Assume that for 

any v S GF(2)"-\ P {Vf-^ = v) = P {Vf^ = v). We have; 

P{T-^E)= Y. ^(^2” = v) 

L'i+s_i=ai, iGNs 

E p{vr^ = -) 

Vi + s-l—CLi, 
i;j+^_iGGF( 2 ), J^Ne 

E Piy2=^) 

Vj+s-2^GF{2), j^NE 

E p{vr^ = ^) 

Vi + s-2—CLi, iGNE 
Vj+s-2GGF{2), J^Ne 

E ^(^ 2 ”=v) 

Vi—ai, iGNe 
V j&GF{2), J^Ne 

= E p{vr-^ = -) 

Vi—ai, iGNe 
V jGGF(2), J^Ne 

= P{E) 

The hrst claim is hence proven. 

To see the second claim, notice Eq. (19) can be written as 

Py. ([V, 0 ]) + Py. ([V,l]) = Py„ ([ 0 ,v]) + Py„ ([l,v]) 

Summing both sides over all v G GF(2)"“^, we then obtain a 
trivial equation, which implies one redundant equation in the 
system. ■ 

• The condition of bit-symmetry is also equivalent to a 
set of linear equations Px^'''^ (^) “ Px yi + £ (^X) Vx G 
GF(2)^+^. In fact, this condition is not even needed; one 
can easily establish a result for similar to Proposition 
4 using the same technique, i.e. 

sup (Px„"+4) = sup Cf (Px„"+4) 

Stationary Bit-symmetric, ^ 

stationary P^„+£ 

With those points made above, we conclude that hnding is 
a convex optimization problem, which can be efficiently solved 
using various computational methods [22]. The results are 
shown in Fig. 5 for £ = 2,3,..., 7. It can be observed that the 
series of upper bounds improves over the genie-erasure upper 
bound, given in [ 8 ], for most values of pid and approaches 
closely to the new lower bound at low pi^. 

The complexity of the program to hnd scales rapidly 
with £; it increases polynomially with the number of variables 
2 ^+^ and the number of equality constraints 2 ^ (including the 
stationarity condition and the unity-sum condition, excluding 
the bit-symmetry condition). Nevertheless Fig. 5 suggests that 



Fig. 5. New upper bounds for C, = 2,3, ...,7. The higher curve 
corresponds to the lower C. 



Fig. 6. New bounds for the case = 1 in comparison with bounds derived 
in [9] (dashed curves). 

the upper bounds converge quickly with £, so small £ is 
sufficient to produce decent results. 

In Fig. 6, we also compare our new bounds to those in [9], 
which were derived for the case pd = 1. Improvements are 
observed. 

As a note. Lemma 15 should not be used in place of Eq. 
(18); otherwise the maximization problem would be a non- 
convex one. 

B. Tightness of the New Upper Bounds 

It is easy to see that the bound is improved monotonically 
with increasing £. Indeed, 

H {Yr,+c+i\Y:Y) = H {Yr,+c\Y:+^Y 
< H 
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where the first equation is because of the channel output’s 
stationarity, given that the channel and the input are stationary. 
We also have: 


H {Yn+c+l\ 
= H (y„+£ 
> H{Y^+c 
= H {Y^+c 


-\^n+-C Y'n+-C+1 7 


\-yn+C — l 
\ ^ n 


xAn+£ 
^n -1 f 


Zn-l) 


|-y'n+£ —1 Y'n+-C 
n 5 ^n —1 ’ 


Zn— 


n— 1 5 


\-yn-\-C — l y 

\ -^n+1 ’ "^n f 


Zn) 


where the first and last equalities are by arguing in a similar 
manner to the derivation of Eq. (11) and (18) respectively. 
These show that 

< Cf (Px-^c) 

where P^^n+c is a marginal distribution of P^n+c+i. Taking 
the supremum, we obtain C < . 

We next discuss the gap between C and C^. Here 
exists since decreases with increasing C and is bounded 
from below. We first connect C and by estab¬ 

lishing the following: 

Cf {Px-+‘^) 

lim lim H{Yn\Y^-\X^) 

.n—Kx ' ' ' n—>00 ' ' JX°°r^P^cx, 

0 

where the right-hand side is evaluated w.r.t. a stationary input 
distribution Pxg° such that for every £, P^n+c is a marginal 
distribution of Px“ ■ Let g {Pxg °) denote the right-hand side. 
Since Px“ is stationary, all (£ + 1)-dimensional marginals of 
Pxg° are the same. We can then regard (•) as a function 
of Px“- With an abuse of notation, we write to 

mean (^P^n+c ^ It is easy to see that 

Ct{Px-) 

= H {Yn+c\Y:+,^-^) + H {Zn\Y:+,^-\X'^+‘^) 

- H {Yn+c\Y:+f-\X:+^) - H {Zn\Y;:+f,X:+^) 


Notice that 


LT { y |-^n+£ —1 -^n+£'\ Tzr ( y \\r 7 i-\-C — l -^n+£ — 1 \ 

H > H {Zn\Y:+,^,x^+^) 

whereas 


whereas 


c: 


uh 


lim sup Cf (Px“) 

Stationary Px^ 


The gap hence lies in the order of the limit and the supremum. 
This is a consequence of computational feasibility: restric¬ 
tion to finite-dimensional distributions P^n+c is required for 
computations, whereas C is inherently a quantity with infinite 
block lengths. We conjecture that no other upper-bounding 
methods that directly modify the entropy terms in Eq. ( 8 ) (or 
the mutual information term in Eq. ( 6 )) are better than the 
presented one. If the discussed gap is non-trivial and one seeks 
a better upper bound, a different formulation of the capacity 
might be called for. 

In fact, in the Shannon-theoretic framework, it shall be 
proven that the gap is trivial, i.e. = C\ We have: 


(a) 


£Cf{Px-)<Y.^f{Px-) 


< 


L, 

£ 


Y 


n+j 


i=i 


^n +1 


l; 


n+j 




V^+j- 

^n +1 


■)-«( 

■)-«( 


n+j 


yn+j- 

^n+1 


n+j 


Y 


n +1 ’ 


1 vn+j 2 'i 

ry \ 

5 j 




= H (y„'V+^) - H {y:+,^\x-+^,Z n) 

_ T ( -v^n+£, TAn+£\ \ T ( y . \rn-\-C I -v^n+£\ 

— ^ ’ ^n+1 ) -^n+1 \^n ) 


and therefore 

sup i{x-+^-Y:ti) 

Stationary Px^ 

+ lim ^ sup 

^—^00 Stationary Px^ 

= lim^ sup l{X:+^;Y:+f) 

= lim - sup 

n->-oo n Stationary 

< lim -sup/(X”;F") 
n—)-00 n 

i^C 


lim H {Zn\Y-+f-\x:+^-^) 

L -—^00 

= lim 

L -—^00 

Eurthermore, thanks to stationarity of the channel output and 
the joint input-output distribution, 

lim H (Yn+cK+f-^) = lim H {Yn\Y^-^) 

£—>00 11/ n—>00 ' 

lim H {Yn+c\Y:+f-\x:+^) = lim H{Yn\Y^-\X^) 

£—>■00 VII n—>00 ' 

Therefore, [Pxg^) 9 {Pxg°) as £ —>■ 00 , for any 
stationary distribution Px“- Now notice that 

C= sup g[Pxg^)= sup lim C'^''(Pxo“) 

Stationary Px^ Stationary Px^ L—^’OO 


Here (a) is because C'j'+i {Pxg°) < C'“^ {Pxg°) as proven 
above, ( 6 ) is because 

0 < {Zn;Y:+f\x:+^) < 1 log |Z| = i ^ 0 

(c) is because of stationarity, and (d) is Eq. (10), stemming 
from the fact the DID channel is an indecomposable ESC in 
the Shannon-theoretic framework. Einally, since C < C^, we 
have = C. 

Aside from increasing £, one may expect to obtain a better 
upper bound by adding more constraints on the input search 
space. Note, however, not all constraints would help. One 
example is bit-symmetry as pointed out earlier. An opposite 
example is the stationarity condition, which is discussed next. 
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C. Importance of the Stationary Condition 

We shed light on why the stationarity condition is crucial 
to developing meaningful upper bounds. In particular, without 
this condition, they could be trivialized, i.e. equal to 1. To this 
end, let us consider the following quantity: 

Uc= sup Cf(p^.+c) 


which is essentially without input stationarity. Notice that 
when the input is not stationary, is implicitly a 

function of n, denoted by 7 (n) to emphasize this dependency. 
In a similar manner to the derivation of Eq. (11), it is easy to 
see that if P^r^+c = P* yields the supremum of 7 (n) then 
Px n+c+i = P* yields the supremum of 7(71 + 1). As such, 
the supremum in Uc is independent of n, i.e. Uc is not a 
function of n. 

First, we need to justify that Uc is a valid upper bound on C. 
This is indeed the case since Uc > > C. Alternatively 

this can be shown without resorting to as follows. We 
modify Eq. (17) and (18): 

hm <iiminfij(r„+£|r„"+i^-i) 

n—^oo ' n—>-oo 11/ 

hm 

> hm sup H (r„+£ I 1 

n—>-oo 


which are valid without the restriction to stationary inputs. 
Then: 


r ihTiinfiT(y„+£|y„"+i''”')- 1 

hmsup’ir(y„+£|y”+i''”'.^r^.^n) f 

n—fOQ ) 

< sup lim inf 

< hm inf sup Uc 

where the suprema are over all valid inputs, and (a) is because 
Uc is independent of n. 

We now show that P£ = 1 in the case pi — pd for any 
C >2. Consider a specific input distribution P^„+c such that 

p (a:”+^ = ... 1010100) = p =...0101011) = 0.5 


i.e. only two input strings with identical hrst two bits and alter¬ 
nating bits thereafter are assigned probability 0.5. It is easy to 
see that P'^„+c is not stationary. With probability 1, we have 
Xn+c = Xn^c- 1 , which completely resolves uncertainty in 
y„+£, and so H{Y^+cK+f-\X^+^,Z^) = 0. For any 
y G GF(2)^-i, 

P (W = = y) P = y) 

= 0.5P = y\Xl+‘^-^ = ...1010) 

P (W = =y)P = y) 

= 0.5P = y\Xl+‘^-^ = ...0101) 


As a property of the DID channel, for any output sequence 
y G GF (2)^”^, there exists only one z G GF (2)^”^ 
such that the event = y|X^+'^“^ = ...1010} is 

equivalent to 


event {Yftf' ^ = y|X"+^ ^ = ...0101) is equivalent to 
= ^z). But we have 

p(^:+f-' = z)=p(z;)+f-i = ^z) 

since pi = pd. As a result, 

P (W = 0\Y:+f-^ =y)=p {Y^+C = = y) 

which implies Yn+c is independent of and 

P {Yr,+c = 0) = P {Yr^+c = 1). Then H (y„+£ |y„'!+^-i) = 
1 , and consequently — 1- We thus have 

(jub < 1, in which the latter inequality is 

the trivial upper bound 1. Therefore, P£ = 1 for any £ > 2. 
When £ = 1, it is easy to construct the same P^„+c and 

show that ^P^„+£^ = 1. Notice that P^„+c satisfies the 
stationarity condition in this case. Then both and Uc are 
equal to 1. We see why £ > 2 has been used to obtain good 
upper bounds so far. 

In the general case where pi f pd, we wish to find 
some similar distribution P^„+c- We may choose one such 
that P{Xn+c f Xn+c-i) = 0, i.e. input strings whose 
Xll^C-i either 10 or 01 are not allowed, and hence 
the term H (Yn+c\Yn^-f'~^, X^~^^, Zn) is eliminated. In 
order that H {Yn+c\Ynlifi~^) = 1, we seek to have 
p (y„+£ = = y) = P (W = i\Y:ti^-^ = y) 

for any y G GF(2)^ This is equivalent to finding P^„+c 
that satisfies 

^ \ p {Y:^f~^ = y\Xn+c-i = = x) 

Vy G GF(2)’^“^, which accounts for equa¬ 

tions. We further have 2^ equations for the condition 
P (-A„+£ ^ Ai„_|_£_i) = 0, where half of the input 

strings are assigned probability 0, and another equation for 
Px’'+‘^ (^) ~ Then we are left with a system of 2^+^ 
variables and 2^ + 2^~^ -f 1 equations totally. It is easy to 
see that 2^+^ >2^-1- 2^~^ + 1 for any £ > 1, which 
means we may able to find at least one input distribution 
P^„+c that results in ^P^„+£^ = 1, implying again that 
Uc = 1. While this is not guaranteed due to the non-negativity 
condition P^„+c (x) > 0 Vx G GF(2)’^+t^ simulations show 
that it could be indeed the case for pi > pd, and when pi < pd, 
the upper bound is also worse (see Fig. 7). 

VI. DID Channel Capacity AT Fow Noise 

Observe that our bounds visually match up with one another 
and with the lower bound on the i.u.d. achievable rate 
(from Fig. 4) nicely for pid < 0.1. It is thus clear that for 
this noise range, the computed values are the true capacity. 
In this section, we will be interested in a finite-lettered 
characterization of the capacity at such low noise. Again, we 
will restrict the analysis to the case Pi = Pd = Pid, in particular 


P(^:|f_i=00,W-+^-2 = x)x 1 

= y\Xr.+c-i = = x) / 
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Fig. 7. Comparison of the upper bounds for different pi and p^. Here £ = 4. 


Pid < 0.5. In principle, other cases can be resolved in a similar 
fashion. 

Along this line of work, Kanoria and Montanari [23] 
achieved the same goal for the deletion channel. Their ap¬ 
proach and ours share certain similarities: both find a lower 
bound and an upper bound, and the lower bound is simply 
the achievable rate with a specific input distribution, which 
is the i.u.d. input^. The key difference lies in the upper- 
bounding technique. In their analysis, first, an intermediate 
input distribution class is shown to yield an “upper bound” on 
the rate achieved with any input distribution^; second, the rate 
achieved by this class is then expressed in terms of the channel 
parameters. The choice of the intermediate input class is not 
arbitrary. On one hand, properties of this class should offer 
sufficient analytical advantages. On the other hand, the class 
should be sufficiently broad so that it is possible to find an 
upper-bounding relation with any input distributions. Indeed 
the choice in [23] appears to be quite specific to the deletion 
channel. 

In our case, it is not obvious how to pinpoint such an 
input class. Our problem hence calls for a different approach. 
Suppose that / (/i) is a good upper bound on the rate achieved 
by an input p. At each fixed pid, we find the deviation 
A/ = / (p* *) — / (iiiud)^ which results from the deviation of 
fi*, one of the capacity-achieving inputs, from piud the i.u.d. 
input®, via the Taylor expansion theorem. Then the order of 
magnitude of A/ is evaluated w.r.t. pid- To make this feasible, 
an important observation is that since we know the i.u.d. input 

^The journal version [7] of their work extends further the low-noise expan¬ 
sion by considering a broader class of input dish'ibutions that encompasses 
the i.u.d. input. 

*In fact, in [23], the rate of any input is upper-bounded by the rate of this 
class plus a quantity. This quantity is almost input-independent and small at 
low-noise levels. 

®There can be many inputs that achieve the capacity. 



Fig. 8. Illustration of the low-noise calculation strategy. The upper and lower 
solid curves are / corresponding to a capacity-achieving input and the i.u.d. 
input respectively. A different input deviation from the i.u.d. input would 
yield a different curve, e.g. the dashed curve, which may not come close to 
the lower solid curve as pid —> 0. 

is the only capacity-achieving input when pid = 0, /i* —>■ piud 
as Pid 0. The use of A/ hence circumvents the need 
for a specific intermediate input class. Note that / is, strictly 
speaking, a function of p and pid- However an expansion about 
both piud and pid = 0 is expected to yield the trivial rate of 
1, which is not useful. This explains why we need to treat pid 
as a fixed parameter in the expansion of /. Fig. 8 illustrates 
pictorially our strategy. 

As observed from Fig. 8, the two curves close their gap and 
hence A/ —> 0 as pid 0, which makes sense since the i.u.d. 
input achieves capacity at pid = 0. This is the basis for us to 
evaluate the order of magnitude of A/ in terms of pid- 
We first define the O and 0 notations. Since pid is small, 
we say 

. a quantity Q scales as O {g {pid)), or Q = O {g [pid)), 
if limp.^^o {Q /g {Pid) ) is finite, 

. a quantity Q scales as 0 {g (pid)), or Q = Q {g (pid)), if 
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limp.^_>.o {Q /g {pid) ) is finite and not equal to 0 , 

• a quantity Q is more significant than O {g (pid)) if 
limp.^^o {Q /g (Pzd) ) is infinite, 
should the limits exist. 

A. Low-Noise Lower Bound 

is sufficient for this purpose. The hrst term in 
given in Eq. (12), can be expressed as follows: 

iT(r„|r„_i,x„_2,z„_2) 

= \h^ (i - ip..) 

+ ^2 ~ (^2 ~ ^ 

(4 (2 ~ 2^*”^ ^ ( Pid )^ 

= ^ + 0 {p^,) 

since /i 2 (|+a:) = 1 + 0 {x^)- The second term, given in 
Eq. (13), can be shown to be more signihcant than O (p^j)- 
Therefore, we leave the whole term as it is, and obtain the 
following: 


for any xi,X 2 S {0,1}, leading to p 2 = ps- With the fact that 
Pr + P 2 + P 3 + P 4 = 1 , we are left with 2 free variables pi 
and p 2 such that p 4 = 1 — pi — 2 p 2 - Then: 


2 ptPd 


1 - 


Pi +Pd 
Eor the term lim„_>.oo 


Pi + 1 - 


4:PiPd 

Pi+Pd 


P2 


2 piPd 

Pi+Pd 


H in Eq. (13), we have 


a = P{Xn ^ Xn-i) = 1 - Pi - P 2 . Let (5i = Pi - 
62 = P 2 — 

f { 61 , 62 ,p^d) = H{Yr,\Yn-l)- lim ij(y„|r 


n —1 




The non-negativity condition of the input distribution (i.e. 
Pi > 0, P 2 > 0, P 4 = 1 — Pi — 2 p 2 > 0) translates to 

- > (5i + 2 ^ 2 , 61 > —-, 62 > —- ( 21 ) 

The Taylor expansion theorem with Lagrange remainder ap¬ 
plied to / about the i.u.d. input (i.e. 61 = 62 = 0 ) with 
Pid hxed as a parameter gives Eq. (22) for some c G [0,1] 
(see next page). Let J]" and 62 be values that correspond to a 
capacity-achieving input. As an upper bound on the capacity 
C = C (pid) at Pid, 


C>CZd = ^- 


E 

.fc=i 


2^+1 


R{Pid',k) 


o{pI 


id) 


where 


R [Pid] = ^2 ( ^ + ^ (1 - ‘^puf 


( 20 ) 


B. Low-Noise Upper Bound 

A suitable upper bound not only is sufficiently sim¬ 
ple to analyze, but also well retains its second term 
lim„_>oo H so that it can match up with the 

summand in Eq. (20). We now develop an upper bound with 
lim„^ooi? < iT(y„|y„-i), which is Eq. (17) 

with £ = 2 (which should suffice since the hrst term on 
the left-hand side of this inequality decreases slowly for low 
Pid, as suggested by Eig. 2), while leaving the second term 
unchanged as before. 

Let 


Pi — P {Xn — Xn-1 — Xn- 2 ) 

P2 =X„_1 ^X„_2) 

P 3 =P(X„ ^X„_1 =X„_2) 

PA= P {Xn = Xn-2 ^ X„-i) 


The bit-symmetry condition gives us 


p {x :_2 = 000 ) = 
p {x :_2 = 100 ) = 
p (x:_2 = no) = 

P = 010) = 


P(X^2 = lll) = 
P(X^ 2 = 011 ) = 
P(X^ 2 = 001 ) = 
P(X ^2 = 101 ) = 


Pr/2 

P 2/2 

P 3/2 

Pa /2 


The stationarity condition, as in Lemma 17, requires that 

p {x:_, = XAX 2 ) = p {x:z^2 = *n2) 


c {pid) < f {6*1,62, Pid) 

Observe that / (0, Q,Pid) matches up with up to O {Pid), 
which is expected for the zero-order term of a good upper 
bound expanded about the i.u.d. input. Therefore we only need 
to estimate how the rest of the terms, which represent A/ in 
the opening discussion of this section, scale with pid, without 
knowing their exact expression in terms of pid- 

Allowing approximations to reduce unnecessarily complex 
algebra and noticing that (5]“ and 62 are functions of Pid, we 
have: 


Ai {pid) « A2 {pid) = A {pid) 

^2,0 {c6*, 062,Pid) « Po,2 {c6*,c62,Pid) 

« Bi,i {c 6 l,c 62 ,Pid) = B {p^d) 


One can show that A {ptd) = © {Pid) with elementary cal¬ 
culations. The analysis of B {pid) requires more care, since 
it involves J]" and 62 - Let bi {pid) denote the hrst term in 
B {pid), i.e. 


bl {Ptd) = -r 


2(1- Pzd) (1 - ‘^Pid) 


1 - ( \pid + r" 


In 2 


where r* = 2 (1 - p^d) c 6 / + 2 ( 1 - 2 pid) 062 , and 62 {pid) = 
B {Pid) — bl {pid)- For ^4 and 62 under the constraint ( 21 ) and 
Pid < 0.5, noticing that 6 / and 62 approach 0 as Pid —0, 
one can see easily that bi {pid) > 0 and hi {pid) = ©(I)- 
Also, hi {pid) is undehned if and only if c = 1, = 0.75 and 

62 = —0.25, in which case the capacity-achieving input only 
allows the all-zeros and all-ones input strings and hence the 
capacity is trivially 0. However is always away from 0 
for any pid- We therefore exclude such large deviations from 
0 of 5* and 62 - 
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/ ((5i,(52,Pid) = / (0,0,Pid) + Ai (Pid) + A 2 {prd) S 2 

- B2,o {c6i,cd2,Pid)Sf - 2 Bi,i {cSi,cS2,Pid)SiS2 - ^0,2 icSi,cS2,Ptd) Sj 


( 22 ) 


f {0,0, P^d) = 1 - 


1 




.fc=l 


C? {pI 


Ai {p^d) = (1 - P^d) log ^ ^ {P^d-, k) 

k=l ^ 

A 2 {Pzd) = (1 - 2 p*d) log ^- 7 ^ - {P^d.; k) 


2 + Prd 2'= 


- 62,0 (ci5i, c(52,Pid) = 


2(1-Kd) 


+ i ^ ^ (g - 1)^ + 2g - 3 ) A: + 2 i? [pid] k) 


1 - ( \p^d + r 


In 2 


fe=l 


Bo, 2 {cSi,cS 2 ,Ptd) = J -— ^--51 -^ (<? - 1)^ A:^ + + 29 - 3 ) A: + 2 R{pid]k) 


1 


1 - 1 2^^*^ + ^ 


-Bi,i (c(5i,c(52,p*d) = 


In 2 '==1 

2 (1-Pid) (1 - 2pid) 1 


1 - ( ^Pid + r 


In 2 




+ - ^ g'" ^ (g - 1)^ A:^ + + 2g - 3 ) A: + 2 R {pid] k) 


r = 2 (1 - pid) c5i +2(1- 2 pid) 082 


g = - + C(5i + C(52 


To evaluate 62 {pid), we first note that q* = \ +c 6 i +062 € 
[ 0 , 1 ] under the constraint ( 21 ). g* = 1 if and only if c = 1 , 
81 = 0.75 and 82 = —0.25. So for the same reason above, we 
can bound g* < e < 1. In addition, since g* is hnite, there 
exists a polynomial P 2 (k) of degree 2 over the real numbers 
such that P 2 (k) > {q* - if k^ + {q*^ + 2g* - 3) A: + 2 for 
any A; > 1. Letting Mk = e^~^P 2 {k), we then have > 
g*fc-3 _ 1)2 ^2 ^ (^*2 + 2g* - 3) fc + 2 ] P (p,d; k) I for 

any pid, since |P(pid;A:)| < 1. We also have that Yfk=i^k 
converges as a straightforward application of the ratio test for 
series convergence. Then by the Weierstrass M-test, we have 
that &2 {Pid) converges uniformly. Consequently, 


lim 62 {pid) 

Pid^O 


= 0 



q*^ ^R{pid-,k)x 

(g* - 1)^ A;2 + (g*2 + 2g* - 3) A: + 2 


That is, 62 {Pid) = O (1) (and in fact, of order of magnitude 
much lesser than 0 ( 1 )). 

As a result, B {pu) = 0 (1) forp^d < 0.5. Since 62 {Pid) < 
81 (pid) and bi (pid) > 0, we can conclude that B {pid) > 0 
at sufficiently low p^d- In fact. Fig. 9 suggests that B {pid) 
is potentially non-positive only in extreme cases where pid is 
close to 0.5, (5^ is close to 0.75 and 82 is close to —0.25. For 


the purpose of obtaining low-noise approximations, we can 
therefore say that B (pid) > 0. We then have: 


C{Prd) - f {0,0, Pid) 

< f {^1,81, Prd) - f {0,0, Prd) 

« (5* + 8 * 2 ) A {p,d) - (<5r + ^2*)' B {p^d) 

< sup [tA {prd) - t^B {prd)] 


A'^ {Prd) 

'IB {pid) 


0 {pD 


(23) 


using the identity sup^g^ (at^ + 6f + c) = (4ac — 6^) / (4a) 
for a < 0. This completes the derivation of the low-noise 
upper bound. 

The same conclusion, C {pid) < / (0, 0,pid) + 0 {Pid), can 
be reached without the approximations by doing some tedious 
algebra. Approximations do not affect our result anyway, since 
in this analysis only the order of magnitude of A {pid) and 
B {pid) matters. We also see that A/ in the opening discussion 
corresponds to 0 {Pid), which tends to 0 as pid —?► 0. This 
concurs with the discussed observation from Fig. 8. 

As a note, it is clear why the expansion of / is up to 
the second order. The hrst-order expansion would make the 
supremum in our hnal step unbounded, while any higher-order 
expansions would complicate the analysis. 
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Fig. 9. Sign of B (pi^) for different values of pid, 5* and which 
satisfy the constraint (21). The summands of 62 {Pid) limited to a 
maximum k of 2000 for computational feasibility. In general, the “non¬ 
positive” region shrinks as the maximum k is increased. Here c is set to 
1. However this particular choice of c is irrelevant, since values of B {pid) 
with (c,(5 ^,52) = {c', di, d 2 ), where c' < 1 , are equal to those with 
(c, = {l,c'di,c^d 2 ). 



Fig. 10. The DID channel capacity at low noise. 


C. Low-Noise Characterization of the Capacity 
Putting Eq. (20) and (23) together, 


C = l- 


I] k) 

_A :=1 


+ ^ {Pid) 


This characterization, ignoring O is plotted in Fig. 10. 

It can be easily shown that R {pid\ k) is concave in pid 
for every k > 1 and pid S (0,0.5). Therefore this analysis 
suggests that the capacity is a convex function of the channel 
parameter pid in the low-noise regime. 


VII. Concluding Remarks 

This work has illustrated the use of the stationarity condition 
of the supremizing input set in the capacity formula in a 
case study of the DID channel model. Input stationarity has 
been identified in both ergodic-theoretic and Shannon-theoretic 
capacity formulations. Evidently this condition is pivotal to 
many of our results. Without it, the new upper bounds could 
become trivial as discussed in Section V-C. Moreover this 
condition helps describe finite-dimensional marginal distribu¬ 
tions P^n+b, with a < b, for all n’s in only a finite number 
of variables, which is key to establishing Lemma 15 and 
subsequently the low-noise characterization. 

Our simultaneous treatment of the two separate, seemingly 
unrelated frameworks is necessary. Input stationarity arises 
naturally in the ergodic-theoretic framework. The reason is that 
this framework allows only infinitely long input sequences, 
an example of whose source is stationary inputs. On the 
contrary, in the Shannon-theoretic framework, allowable input 
sequences have finite (block) lengths, and only when reliable 
communications is of concern (i.e. the error probability is 
driven to zero) are the block lengths increased to infinity. This 
setting makes it harder to see whether stationary inputs and 
the likes should play any role in capacity formulations. As 
witnessed in Section III-D, via a connection to results in the 
ergodic-theoretic framework, developments to a formulation 
with stationary inputs could be realized. With further scrutiny, 
one sees that the difference between the two frameworks 
lies mainly in their operational structures, whereas the said 
connection is purely on information-theoretic quantities, which 
are dictated by the joint input-output distribution, shared by 
both frameworks. Curiously the Shannon-theoretic framework 
is not the only beneficiary. An argument in Appendix A shows 
that so is the ergodic-theoretic framework. 

Next we discuss a few relevant directions concerning the 
capacity evaluation techniques for further investigations. 

a) Channels with High Local Dependency: The DID 
channel output has relatively low local dependency: Yi is 
determined only through the present and the immediate past, 
{ Zi, Xl_ ^}. This may give an intuitive explanation of why 
the upper bounds in Fig. 5 converge very quickly at low C. 

We however cannot expect this to be the case for other 
practical channels, e.g. the model in [10]. As mentioned in 
Section I-C, the DID channel only captures partially key 
features of the BPMR write process. For example, it can 
be observed that in the DID channel model, an output sub¬ 
sequence either remains synchronized with its corresponding 
input sub-sequence or leads it by 1 bit. The scenario in which 
the output sub-sequence lags behind the input sub-sequence by 
1 bit is therefore missing. This was considered by the model in 
[10], which sets the alphabet of Zi to Z = {—1, 0,1}. While 
such Z makes the channel non-causal, one can redefine the 
input to be Xi = Xi_i and consequently Yi = Xi+i-Zi, 
which is a mathematically causal channel and to which Eq. 
(8) is again applicable. The scenario where misalignment by 
more than 1 bit is allowed was discussed in [8, Section VI], 
in which case Z also has larger sizes. 

All those considerations lead to higher local dependency. 
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which gives rise to numerous difficulties. At hrst glance, the 
convergence would be slower. To resolve this, one may in¬ 
crease C (or some similar parameters that control the bounds) 
to obtain the desired accuracy to the true capacity. However the 
larger C is, higher the computational complexity. The intrinsic 
tradeoff between accuracy and complexity is therefore more 
stringent in this case, which may call for a modihcation or a 
different formulation of the upper bound. 

b) Channels with Substitution Errors: Substitution errors 

are usually inevitable in practical systems. For example, it 
was noted in [10] that insertion and deletion errors, underlied 
by the Markov channel state are not sufficient to 

faithfully describe all imperfections that arise in the context 
of BPMR. Their model considered burst substitution errors 
that accompany insertion and deletion events, in addition to 
random substitution errors caused by a localized random phase 
drift between the desired and actual window to write each bit. 

In general, we may model substitution errors by Yi = 
Xi-Zi 0 Bi, where Bi is a binary random variable rep¬ 
resenting the additive noise, independent of the input. A 
straightforward approach is to take into account Bi and do 
the exact calculation. Another way is to view the model as 
two subsystems, in which one is Ai = Xi-Zi, concate¬ 
nated with the other Yi = Ai (B Bi. Then an upper bound 
on the capacity is given by the data processing inequality, 
/(X;Y) < min {/(X; A), / (A; Y)}, suggesting that C < 
min {Cl, (72}, where C, Ci and C2 are the true capacity, the 
first subsystem’s capacity and that of the second subsystem 
respectively. This gives a simple benchmark upper bound. 

c) Finite-Block-Length Regime: One natural question is 
how the fundamental limit behaves when the block length 
is hnite (see e.g. [24], [25]). In this analysis, the capacity 
is known as the hrst-order coding rate, and the quest is to 
hnd higher-order coding rates given a hnite block length n 
and some non-zero tolerable error probability e. Note that 
the hnite-block-length analysis is usually developed under the 
Shannon-theoretic framework. It is unclear how the ergodic- 
theoretic channel dehnition could extend itself to encompass 
hnite block lengths. 

While hnite-lettered characterizations of the second-order 
coding rate have been determined precisely for certain chan¬ 
nels, they are not known in general for channels with memory 
or complex structures. Since the DID channel’s hrst-order 
coding rate has been determined approximately under the 
Shannon-theoretic framework as discussed above, it would be 
interesting to see how one can approximate its second-order 
coding rate as a series expansion in the channel parameters. 

Appendix A 

We consider the ergodic-theoretic “capacity” formulation 
with general initializations of the state Z^. The DID chan¬ 
nel is an FSC in the Shannon-theoretic framework. It was 
pointed out by Kieffer and Rahe [26] (and also by Gray et 
al. [27]) that FSCs can be naturally described as a special 
case of (one-sided) Markov channels in the ergodic-theoretic 
setting. Furthermore they showed that Markov channels are 
asymptotically mean stationary (AMS). Hence so is the DID 
channel. 


For any pi and pd in (0,1), the process is mixing. 

As such, the DID channel is output strongly mixing, in light 
of Section II-A. Since the channel is (one-sided) Markov, by 
[27, Theorem 2] and [27, Lemma 3], it is ergodic. 

By [5, Theorem 12.6.1], the following rate is achievable: 

C*{Pz)= sup hm (24) 

AMS n 

Here the subscript AMS means that it is calculated w.r.t. the 
stationary mean of the joint input-output distribution, and the 
notation C* (Fz) emphasizes the dependence of the rate on 
the distribution Pz of the process {Zi}'^^. 

All that is left is to establish an equality between 
Eq. (6) (which assumes P{Zi—Q) — Pd/{Pi+Pd) and 
P {Zi = 1) = Pi! {pi -\-pd)) and Eq. (24). We digress from 
the task temporarily. The notion of AMS processes does 
not closely describe the processes associated with the DID 
channel. We hence appeal to the following dehnition. 

Definition 18. A random process defined over 

a probability space (H, P), is asymptotically stationary 

(AS) if yP G P, the limit P°°{E) = P{T-'^E) 

exists. is then called the asymptotic stationary probability 
measure. 

One can easily prove that a stationary measure is AS and 
AMS. In addition, we have the following lemma. 

Lemma 19. An AS process is AMS. Furthermore, the station¬ 
ary mean is its asymptotic stationary probability measure. 

Proof: The claim follows easily from the dehnitions and 
the Ces^o mean theorem. ■ 

Consider two independent processes and 

whose underlying probability measures are 
respectively Pa and Pb. Let Pab denote the probability 
measure of the joint process For any event 

E on this joint process, we have: 

Pab (T-^E) = j Y. ([a', a], [b', bj) 

(a,b)eB 

= I ^ d(P4xPB)([a',a],[b',b]) 

(a.b)eB 

= I d (Pa xPb)(T-" {a}, r-"{b}) 

(a,b)GB 

This shows that if and are AS, 

{(A„, is also AS. Moreover P’ab determined 

completely by P“ and P^, which denote the associated 
asymptotic stationary probability measures. 

Next, consider a process {Vn}ifZi, dehned under Py. 
Dehne another process {H4 : VF„ = fiVn, 14_i,..., Vn-k)}, 
where ^ is a deterministic and time-invariant function and 
k is hnite. Let Piy be the underlying probability measure 
of , + and consider an event Ew on the process. 

Let £i = {ui : u G Ew}- Let = (w : Wi G £i} and 
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Ey = {v : ^ S fi}. Then for any j > 0; 

/ oo \ 

Pw (T-^Ew) = Pw T-^ n 




= Pw i n {w : w,+j G Si} I 

\i-/c+l / 

/ oo ^ 

V n {^- ^{<Vj-k) ef*} 


= P 




=Pv T-^ n 


i—k-\-l 


where the third equality is because firstly, any v G 
n~fc+i {v : 4 > under (j), is transformed into 

some w G Htfe+i ■ Wi+j G Si], and secondly, for any 
V ^ rifefe+i 1''^ • 4 > G Sj^, there exists at least one 

index r > fc + 1 such that (p ^ Sr and so v is not 

transformed into any member of H^fc+i ■ Wi+j G Si}. 
This shows that if {14i}^i is AS, so is More¬ 

over P^ is completely determined by P“. 

We return to our original problem. Consider AMS Px^, 
since Eq. (24) involves only AMS inputs. It is known that 
{Zi}^i is AS for any Pi,Pd G (0,1); moreover it has a 
unique P^ which coincides with its distribution when 
P {Zi = 0) = pd/ {pr + Pd) and P {Zi = 1) = p*/ {pi + pd), 
i.e. when the initialization allows it to be stationary, for the 
same pi and pd- Notice that for the DID channel, we can 
express {Xi,Yi) as deterministic and time-invariant functions 
of Let 

and {Xi, Yi) play the role of Ai, Bi, Vi and Wi, respectively, 
in the above discussion. We see that the joint input-output 
distribution is AS. By Lemma 19, this implies that C* (Pz) 
can be computed w.r.t. Px°°x- Furthermore Px°°x L com¬ 
pletely determined by the input distribution, which is AMS, 
and P^, which is the same as P^. Therefore, instead of using 
the channel law in Eq. (5) which asserts 


Lemma 20. The SE capacity of a consistent SE channel with 
finite alphabets is given by 


CsE 


sup 


lim -/ 

n—>-oo 77, 




1 -A_ i 



Proof: Given any SE input, a single joint distribution 
Px°° Y exists and is SE. With an abuse of notation, we 
therefore drop all subscripts and use p to denote the respective 
probability measure. Let Wn = (Pn, ; then {Wn}^^i 

is SE. 


n 


(^n+A+;y, 


= ^ logp (IE") - - logp ^ logp (E") 

lim -H{W^)+ lim -P 

n^oo ri n—>-oo 77 \ l a_ y 


-f lim -P(y") 

n—foc) 77 

= lim 

n^oo n V / 


where the convergence is almost-sure in Px“ ^ y. by invok¬ 
ing the Shannon-McMillan-Breiman theorem. Since I(X; Y) 
is the left end point of the support of the distribution of 
^ at the limit n ^ oo, this convergence 

implies the claim. ■ 

The main proof is a modihcation of Feinstein’s work [15], 
which is under the ergodic-theoretic framework. We exploit 
his construction of an SE probability measure from a finite¬ 
dimensional probability measure. For some fixed r G N'*' and 
s = r + A+ -f A_, let us consider an arbitrary probability 
measure pL) of X}^^^. Dehne a probability measure p such 
that for any two integers ki and k 2 where 0 < ki < k 2 . 




/ -y'k 2 S — A_ 
[^kis+l-X_ 


= X 


k 2 S \ 
fciS+1 J 


k2 — l 

pL) (x 


(fc-|-l)s\ 
fcs+1 J 


k—ki 


Py|;,c. (Py|x) = Pz(£:(x,Pk)) 

we can compute C* (Pz) as if the channel law is 

Py|;,c. (Py|x)=P*(£:(x,P,.)) 

i.e. C* {Pz) = C* (P|). When the channel assumes P|, it is 
stationary as shown in Section II-A. By [5, Lemma 12.4.2], 

C*{Pi)= sup lim -/(Yo";Yi") 

Stationary tl 

whose right-hand side coincides with that of Eq. (6). This 
completes our argument. 

Appendix B 

We prove Theorem 11. As a reminder, this proof is under 
the Shannon-theoretic framework. 


It is easy to see that p is s-stationary. Let us define another 
probability measure p such that for any event E on the input: 

p {E) = - [po {E) + Pi {E) + ... + Ps-i (P)] 
s 

where is dehned by (P) = p (T~’^E^ VP, for k = 
0,..., s — 1. It can be easily established in a similar manner 
to [15] that p is SE. Then we immediately see that p is ergodic, 
since for any invariant event P, p (P) = p (P), which is equal 
to either 0 or 1. 

We shall need the following lemma. 

Lemma 21. For any four random variables Xi, X 2 , Yi and 
Y 2 , in which Xi is independent of X 2 , we have: 

I{Xi,X2-,Yi,Y2)>I{Xi-Yi) + I{X2-,Y2) 

Proof: This lemma can be found in [28, Problem 2.4(e)]. 
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We provide a proof here for completeness. 


/(Xi,X2;ri,y2) 

= I {Xi;YuY2 ) + I {X2;Yi,Y2\Xi) 

= I{Xi;YuY2) + I{X2;Yi,Y2,Xi)-I{Xi;X2) 
= I {Xi-,Yi,Y2) + I iX2;Yi,Y2,Xi) 

> I {Xi;Yi) + I {X2;Y2) 


where I {Xi]X 2 ) = 0 since Xi and X 2 are independent. ■ 
Proof of Theorem 11: Let us consider an arbitrary finite¬ 
dimensional distribution sequence which should 

be understood as an arbitrary input sequence and is not 
necessarily consistent, and some r G N’*'. We construct the 
aforementioned probability measures /r and /t from By 
passing the input generated by /r (resp. /t) through the channel, 
we obtain the joint input-output distribution uj (resp. CS) and 
the output distribution 77 (resp. fj). A single measure uj (resp. 
Cj) exists, and consequently p (resp. fj) exists, since the channel 
is consistent. Since the relations are linear, 

W = — (uJq + UJi -h ... -f Wg — i) 

S 

fl= -{V0+m + ■■■ + Vs-l) 
s 

where ujk and rjk are the distributions corresponding to the 
input jik, for k = 0,..., s — 1. It is easy to see that jik is 
s-stationary since /r is s-stationary. Therefore Wfc and rjk are 
s-stationary, since the channel is weakly stationary. Notice the 
following: 

Wfe =^>^" = y) 

xSA'k ' ^ 

X = [x,x]}) 

= u; (t-'= = X, y" = y}) 


where (a) is because the channel is stationary. This implies 
that Wfc = u}T~^ and consequently rjk = rjT~^. 

With these facts, similar to [15], one can prove the following 
limits exist: 


lim — 

n—foo Ti 




lim — 

n—fca Ji 




lim -7Tfl(y") 

n—foo Tl 


lim -H^ 

n—foo Tl 


lim — 

n—foo Ti 


{K-t) 


lim ^P„(y”) 

n—foo Tl 


and consequently, 

lim -/Jx”+^;y") 

n—>-oo Tl \ / 


lim 

n—fca Tl 




" + X+ . XT 
1-A_ J ^ 


where the probability measure subscript in the entropy quantity 
implies that the quantity is calculated w.r.t. that measure, and 
that in the mutual information quantity implies the “rate” 


achieved by using the respective measure as the input dis¬ 
tribution. The superscript (n) is dropped without ambiguity. 

Without loss of generality, let n = is — A+ — A_, in which 
t —>■ 00 . Notice that, due to the structure of jj., the blocks 

|-^( 7 -i^)I+i-a }. independent of each other under jj,. 

Then: 


+X+ . 

-A_ ’ ^ 


(a- 

- 1 1 ^s-ri ’ • ■ ■ J J 

) 


1 
n 

(a) 1 
> 

n 

V- 

(b) 

n ^ 

(c) t 


(z—l)s-t-r‘ 


(i-l)s-|-l-A_ ’ (z-l)s-l-l 


> 


j(r) 


Here (a) is by applying repeatedly Lemma 21; (b) is because 
the channel is weakly stationary, /i is s-stationary and con¬ 
sequently Lu is s-stationary; (c) is due to the fact that the 
distribution of under p, is simply and the channel 

is consistent. 

Next we indicate the choice of r. We have Ve > 0, there 
exists r = r (e) such that 


r -I- A+ -f A_ 
> lim inf 


jir) (^X[+^+;yf) 


N^oo N A_^ + A_ 

lim inf-/W fxf+/+;y^U 
N^oc N V J 


jiN) 


Then: 


lim -4(xr_+^;yA 
n—)-oo Tl \ J 

> liminf —y^) - e 
- TV^oo N V 1-^- ’ ) 

>I(X;Y)-e 


where the last inequality is from [1, Theorem 8 .h]. Maximizing 
the respective in|>ut on each side (i.e. p on the left-hand side 
and ■{ > on the right-hand side), we obtain: 


C — e < sup lim —In 

^ n-foo n 


/ ^n+\+ 



< 


SE 


sup 


lim —I 

n—foo Tl 


( -^n+A+ 



= CSE 


by the fact that jl is SE and Lemma 20. Since e is arbitrary 
and we know that C > Cse, we conclude C = Cse- ■ 


Appendix C 
A. Proof of Proposition 4 

Consider a stationary input p. Let jl be such that jl{E) = 
ji{-nE) for any E C GF (2)°°. Let po = {ji + jl) /2. It is 
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easy to see that T ^ i^E) = -• (T Then: 

MO {T-^E) = 1 (m (T-^E) + M (T-^E)) 

= ^(m (t-^e)+^(t-^(^e))) 

= l(M(E) + MhE)) =Mo (E) 

That is, Mo is stationary. Furthermore, 

Mo (E) = i (m (F^) + M (^E)) = Mo (^E) 

and therefore mo is bit-symmetric. 

Since the channel is fixed, let IniM) denote I {Xq\Y^) 
when the input admits m- Let w, r], w and 77 be the joint input- 
output distributions and the output distributions of m and m 
respectively. We have; 


w {Ex,Ey) 



(Ey) ^M (x) 



{^Ey) (^x) 



{^Ey) dn (x) 


to {^Ex, ^Ey) 


Consequently, 

f]{EY)=uo{G¥{2r ,Ey) 

= uj {^GF (2)°° , —’Ey) = M {~^Ey) 

We then have the following: 

Hn{M) = - E = x}) log M ({X- = x}) 

xGGF(2)"+i 

= - E f({^ 0 = ^x})logM({XJ = ^x}) 

xeGF(2)"+i 

= - E f ({^0 =x})logM({X” =x}) 

xeGF(2)"+i 
= Hr, (m) 

and similarly, (w) = (w), (ry) = Ffn (??)■ Conse¬ 

quently, 4 (m) = Hr, (m) + {rj) - Hr, (w) = 4 (/i). Given 
a fixed channel, it has been shown that 4 (f) is a concave 
function of m [5, Corollary 5.5.5]. Therefore, 

4 (fr) — 4 (/r) 

This is trTie for every n > 1, which implies 

sup lim -/(Xo";y”) < sup lim-/(X”;y") 

Stationary fi TL Bit-symmetric, U 

Stationary /i 


In (Mo) = In[l:{M + M)] > ^In (m) + ^ 


The proof is complete. 


B. Proof of Proposition 14 

Consider a fixed n > 1. Since the channel is fixed, 
let /(Qi’")) denote when - 

. Define where (x) = Qi") (^x) Vx S 

Gp ( 2 )"+'^++'^- Similar to the proof of Proposition 4, it 
can be proven that I = I Next, construct 

an input disttibution on such that (x) = 

(x) -I- (x)) /2 Vx. It is easy to see that 

satisfies (x) = (^x) Vx. Given a fixed channel, 

I (p(")) is a concave function of P^”^ [29, Theorem 2.7.4]. 
Therefore, 

/(pW) = /Q (g(") 


This is true for every n > 1. 
Now for every stationary 


input process IPq = 
(with a single underlying 
probability measure Q), if we can find a stationary input 


{V 


+>+ 

-A_ 




TI-I-A+ 

1-A„ 




n—1 


process IPp = H 


rTL-\-\j^ 

"1-A_ 


p( 


n)4 

J r 


with P(”) 


defined as above, then the proposition is proven since P^"^ 
exhibits bit-symmetry. That is, we need to justify that IPp is 
consistent and stationary. To show Kolmogorov consistency; 


i-ri+A^ 


= E 

X^ + l + A_^ 

= ^0'"’ 0 


p(n-ri) 








n+A+\ 
G-A ) 


(- 


^n+l+A_|_ \ 


"'l-A- 


= p^’") (x”+^+) 


since IPq is consistent. We can then assign a single underlying 
probability measure P to IPp. To show stationarity, note that 
P (P) = (g (P) + g (^P)) /2 VP from the relation between 
p("^ and Then similar to the proof of Proposition 4, we 
have P is stationary since IPq is stationary. 
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